#icu #tantivy

tantivy-analysis-contrib

A set of analysis components for Tantivy

8 releases (breaking)

Uses new Rust 2021

0.7.0 Dec 13, 2022
0.6.1 Nov 14, 2022
0.5.0 Aug 19, 2022
0.4.0 Jun 18, 2022
0.1.0 Mar 22, 2022

#32 in Internationalization (i18n)

MIT/Apache

280KB
7.5K SLoC

Crate Build Status codecov dependency status Documentation Crate Crate

Tantivy analysis

This a collection of Tokenizer and TokenFilters for Tantivy that aims to replicate features available in Lucene.

It relies on Google's Rust ICU.

Breaking word rules are from Lucene.

Features

  • icu feature includes the following components (they are also features) :
    • ICUTokenizer
    • ICUNormalizer2TokenFilter
    • ICUTransformTokenFilter
  • commons features includes the following components
    • LengthTokenFilter
    • TrimTokenFilter
    • LimitTokenCountFilter
    • PathTokenizer
    • ReverseTokenFilter
    • ElisionTokenFilter
    • EdgeNgramTokenFilter
  • phonetic feature includes some phonetic algorithm (Beider-Morse, Soundex, Metaphone, ... see crate documentation)
    • PhoneticTokenFilter
  • embedded which enables embedded rules of rphonetic crate. This feature is not included by default. It has two sub-features embedded-bm that enables only embedded Beider-Morse rules, and embedded-dm which enables only Daitch-Mokotoff rules.

Note that phonetic support probably needs improvements.

By default, icu, commons and phonetic are included.

Example

use tantivy::{doc, Index, ReloadPolicy};
use tantivy::collector::TopDocs;
use tantivy::query::QueryParser;
use tantivy::schema::{IndexRecordOption, SchemaBuilder, TextFieldIndexing, TextOptions};
use tantivy::tokenizer::TextAnalyzer;
use tantivy_analysis_contrib::icu::{Direction, ICUTokenizer, ICUTransformTokenFilter};

const ANALYSIS_NAME: &str = "test";

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let options = TextOptions::default()
        .set_indexing_options(
            TextFieldIndexing::default()
                .set_tokenizer(ANALYSIS_NAME)
                .set_index_option(IndexRecordOption::WithFreqsAndPositions),
        )
        .set_stored();
    let mut schema = SchemaBuilder::new();
    schema.add_text_field("field", options);
    let schema = schema.build();

    let transform = ICUTransformTokenFilter {
        compound_id: "Any-Latin; NFD; [:Nonspacing Mark:] Remove; Lower;  NFC".to_string(),
        rules: None,
        direction: Direction::Forward
    };
    let icu_analyzer = TextAnalyzer::from(ICUTokenizer).filter(transform);

    let field = schema.get_field("field").expect("Can't get field.");

    let index = Index::create_in_ram(schema);
    index.tokenizers().register(ANALYSIS_NAME, icu_analyzer);

    let mut index_writer = index.writer(3_000_000)?;

    index_writer.add_document(doc!(
        field => "中国"
    ))?;
    index_writer.add_document(doc!(
        field => "Another Document"
    ))?;

    index_writer.commit()?;

    let reader = index
        .reader_builder()
        .reload_policy(ReloadPolicy::OnCommit)
        .try_into()?;

    let searcher = reader.searcher();

    let query_parser = QueryParser::for_index(&index, vec![field]);

    let query = query_parser.parse_query("zhong")?;
    let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?;
    let mut result: Vec<String> = Vec::new();
    for (_, doc_address) in top_docs {
        let retrieved_doc = searcher.doc(doc_address)?;
        let values: Vec<&str> = retrieved_doc.get_all(field).map(|v| v.as_text().unwrap()).collect();
        for v in values {
            result.push(v.to_string());
        }
    }
    let expected: Vec<String> = vec!["中国".to_string()];
    assert_eq!(expected, result);

    let query = query_parser.parse_query("")?;
    let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?;
    let mut result: Vec<String> = Vec::new();
    for (_, doc_address) in top_docs {
        let retrieved_doc = searcher.doc(doc_address)?;
        let values: Vec<&str> = retrieved_doc.get_all(field).map(|v| v.as_text().unwrap()).collect();
        for v in values {
            result.push(v.to_string());
        }
    }
    let expected: Vec<String> = vec!["中国".to_string()];
    assert_eq!(expected, result);
    let query = query_parser.parse_query("document")?;
    let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?;
    let mut result: Vec<String> = Vec::new();
    for (_, doc_address) in top_docs {
        let retrieved_doc = searcher.doc(doc_address)?;
        let values: Vec<&str> = retrieved_doc.get_all(field).map(|v| v.as_text().unwrap()).collect();
        for v in values {
            result.push(v.to_string());
        }
    }
    let expected: Vec<String> = vec!["Another Document".to_string()];
    assert_eq!(expected, result);
    Ok(())
}

License

Licensed under either of

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

Dependencies

~14–44MB
~721K SLoC