#icu #tantivy

tantivy-analysis-contrib

A set of analysis components for Tantivy

2 unstable releases

Uses new Rust 2021

new 0.2.0 May 12, 2022
0.1.0 Mar 22, 2022

#187 in Text processing

Download history 23/week @ 2022-03-21 1/week @ 2022-03-28 5/week @ 2022-04-04 13/week @ 2022-04-11 11/week @ 2022-04-18 9/week @ 2022-04-25 10/week @ 2022-05-02 20/week @ 2022-05-09

52 downloads per month

MIT/Apache

135KB
4K SLoC

Crate Crate Crate Documentation

Tantivy analysis

This a collection of Tokenizer and TokenFilters that aims to replicate features available in Lucene.

It relies on Google's Rust ICU.

Breaking word rules are from Lucene.

Features

  • tokenizer : it enables ICUTokenizer.
  • normalizer : it enables ICUNormalizer2TokenFilter.
  • transform : it enables ICUTransformTokenFilter
  • icu : all above features
  • commons : some common token filter
    • LengthTokenFilter
    • TrimTokenFilter
    • LimitTokenCountFilter
    • PathTokenizer

By default, all features are included.

Example

use tantivy::{doc, Index, ReloadPolicy};
use tantivy::collector::TopDocs;
use tantivy::query::QueryParser;
use tantivy::schema::{IndexRecordOption, SchemaBuilder, TextFieldIndexing, TextOptions};
use tantivy::tokenizer::TextAnalyzer;
use tantivy_analysis_contrib::{Direction, ICUTokenizer, ICUTransformTokenFilter};

const ANALYSIS_NAME: &str = "test";

fn main() {
    let options = TextOptions::default()
        .set_indexing_options(
            TextFieldIndexing::default()
                .set_tokenizer(ANALYSIS_NAME)
                .set_index_option(IndexRecordOption::WithFreqsAndPositions),
        )
        .set_stored();
    let mut schema = SchemaBuilder::new();
    schema.add_text_field("field", options);
    let schema = schema.build();

    let transform = ICUTransformTokenFilter {
        compound_id: "Any-Latin; NFD; [:Nonspacing Mark:] Remove; Lower;  NFC".to_string(),
        rules: None,
        direction: Direction::Forward
    };
    let icu_analyzer = TextAnalyzer::from(ICUTokenizer).filter(transform);

    let field = schema.get_field("field").unwrap();

    let index = Index::create_in_ram(schema);
    index.tokenizers().register(ANALYSIS_NAME, icu_analyzer);

    let mut index_writer = index.writer(3_000_000).expect("Error getting index writer");

    index_writer.add_document(doc!(
        field => "中国"
    ));
    index_writer.add_document(doc!(
        field => "Another Document"
    ));

    index_writer.commit();

    let reader = index
        .reader_builder()
        .reload_policy(ReloadPolicy::OnCommit)
        .try_into().expect("Error getting index reader");

    let searcher = reader.searcher();

    let query_parser = QueryParser::for_index(&index, vec![field]);

    let query = query_parser.parse_query("zhong").expect("Can't create query parser.");
    let top_docs = searcher.search(&query, &TopDocs::with_limit(10)).expect("Error running search");
    let mut result: Vec<String> = Vec::new();
    for (_, doc_address) in top_docs {
        let retrieved_doc = searcher.doc(doc_address).expect("Can't retrieve document");
        let values: Vec<&str> = retrieved_doc.get_all(field).map(|v| v.as_text().unwrap()).collect();
        for v in values {
            result.push(v.to_string());
        }
    }
    let expected: Vec<String> = vec!["中国".to_string()];
    assert_eq!(expected, result);

    let query = query_parser.parse_query("").expect("Can't create query parser.");
    let top_docs = searcher.search(&query, &TopDocs::with_limit(10)).expect("Error running search");
    let mut result: Vec<String> = Vec::new();
    for (_, doc_address) in top_docs {
        let retrieved_doc = searcher.doc(doc_address).expect("Can't retrieve document");
        let values: Vec<&str> = retrieved_doc.get_all(field).map(|v| v.as_text().unwrap()).collect();
        for v in values {
            result.push(v.to_string());
        }
    }
    let expected: Vec<String> = vec!["中国".to_string()];
    assert_eq!(expected, result);

    let query = query_parser.parse_query("document").expect("Can't create query parser.");
    let top_docs = searcher.search(&query, &TopDocs::with_limit(10)).expect("Error running search");
    let mut result: Vec<String> = Vec::new();
    for (_, doc_address) in top_docs {
        let retrieved_doc = searcher.doc(doc_address).expect("Can't retrieve document");
        let values: Vec<&str> = retrieved_doc.get_all(field).map(|v| v.as_text().unwrap()).collect();
        for v in values {
            result.push(v.to_string());
        }
    }
    let expected: Vec<String> = vec!["Another Document".to_string()];
    assert_eq!(expected, result);
}

TODO

  • Phonetic
  • Reverse

License

Licensed under either of

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

Dependencies

~17MB
~300K SLoC