23 unstable releases

0.12.6	Apr 11, 2025
0.12.5	Dec 4, 2024
0.12.4	Nov 7, 2024
0.12.1	Jun 21, 2024
0.1.0	Mar 22, 2022

#32 in Internationalization (i18n)

123 downloads per month

MIT/Apache

290KB
7.5K SLoC

Tantivy analysis

This is a collection of Tokenizer and TokenFilters for Tantivy that aims to replicate features available in Lucene.

It relies on Google's Rust ICU. libicu-dev and clang needs to be installed in order to compile.

Breaking word rules are from Lucene.

Features

icu feature includes the following components (they are also features) :
- ICUTokenizer
- ICUNormalizer2TokenFilter
- ICUTransformTokenFilter
commons features includes the following components
- LengthTokenFilter
- LimitTokenCountFilter
- PathTokenizer
- ReverseTokenFilter
- ElisionTokenFilter
- EdgeNgramTokenFilter
phonetic feature includes some phonetic algorithm (Beider-Morse, Soundex, Metaphone, ... see crate documentation)
- PhoneticTokenFilter
embedded which enables embedded rules of rphonetic crate. This feature is not included by default. It has two sub-features embedded-bm that enables only embedded Beider-Morse rules, and embedded-dm which enables only Daitch-Mokotoff rules.

Note that phonetic support probably needs improvements.

By default, icu, commons and phonetic are included.

Example

use tantivy::collector::TopDocs;
use tantivy::query::QueryParser;
use tantivy::schema::{IndexRecordOption, SchemaBuilder, TextFieldIndexing, TextOptions, Value};
use tantivy::tokenizer::TextAnalyzer;
use tantivy::{doc, Index, ReloadPolicy, TantivyDocument};
use tantivy_analysis_contrib::icu::{Direction, ICUTokenizer, ICUTransformTokenFilter};

const ANALYSIS_NAME: &str = "test";

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let options = TextOptions::default()
        .set_indexing_options(
            TextFieldIndexing::default()
                .set_tokenizer(ANALYSIS_NAME)
                .set_index_option(IndexRecordOption::WithFreqsAndPositions),
        )
        .set_stored();
    let mut schema = SchemaBuilder::new();
    schema.add_text_field("field", options);
    let schema = schema.build();

    let transform = ICUTransformTokenFilter::new(
        "Any-Latin; NFD; [:Nonspacing Mark:] Remove; Lower;  NFC".to_string(),
        None,
        Direction::Forward,
    )?;
    let icu_analyzer = TextAnalyzer::builder(ICUTokenizer)
        .filter(transform)
        .build();

    let field = schema.get_field("field").expect("Can't get field.");

    let index = Index::create_in_ram(schema);
    index.tokenizers().register(ANALYSIS_NAME, icu_analyzer);

    let mut index_writer = index.writer(15_000_000)?;

    index_writer.add_document(doc!(
        field => "中国"
    ))?;
    index_writer.add_document(doc!(
        field => "Another Document"
    ))?;

    index_writer.commit()?;

    let reader = index
        .reader_builder()
        .reload_policy(ReloadPolicy::Manual)
        .try_into()?;

    let searcher = reader.searcher();

    let query_parser = QueryParser::for_index(&index, vec![field]);

    let query = query_parser.parse_query("zhong")?;
    let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?;
    let mut result: Vec<String> = Vec::new();
    for (_, doc_address) in top_docs {
        let retrieved_doc = searcher.doc::<TantivyDocument>(doc_address)?;
        result = retrieved_doc
            .get_all(field)
            .map(|v| v.as_str().unwrap().to_string())
            .collect();
    }
    let expected: Vec<String> = vec!["中国".to_string()];
    assert_eq!(expected, result);

    let query = query_parser.parse_query("国")?;
    let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?;
    let mut result: Vec<String> = Vec::new();
    for (_, doc_address) in top_docs {
        let retrieved_doc = searcher.doc::<TantivyDocument>(doc_address)?;
        result = retrieved_doc
            .get_all(field)
            .map(|v| v.as_str().unwrap().to_string())
            .collect();
    }
    let expected: Vec<String> = vec!["中国".to_string()];
    assert_eq!(expected, result);
    let query = query_parser.parse_query("document")?;
    let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?;
    let mut result: Vec<String> = Vec::new();
    for (_, doc_address) in top_docs {
        let retrieved_doc = searcher.doc::<TantivyDocument>(doc_address)?;
        result = retrieved_doc
            .get_all(field)
            .map(|v| v.as_str().unwrap().to_string())
            .collect();
    }
    let expected: Vec<String> = vec!["Another Document".to_string()];
    assert_eq!(expected, result);
    Ok(())
}

License

Licensed under either of

Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

Dependencies

~0.5–2.7MB
~52K SLoC