#search #tokenizer #chinese #tantivy

cang-jie

A Chinese tokenizer for tantivy

16 breaking releases

0.18.0 Nov 4, 2023
0.16.0 Jun 11, 2023
0.15.0 Feb 12, 2023
0.14.0 Jul 13, 2022
0.1.0 Sep 18, 2018

#3 in #full-text-search

Download history 372/week @ 2023-08-11 688/week @ 2023-08-18 804/week @ 2023-08-25 292/week @ 2023-09-01 84/week @ 2023-09-08 70/week @ 2023-09-15 54/week @ 2023-09-22 39/week @ 2023-09-29 141/week @ 2023-10-06 114/week @ 2023-10-13 85/week @ 2023-10-20 135/week @ 2023-10-27 174/week @ 2023-11-03 157/week @ 2023-11-10 191/week @ 2023-11-17 195/week @ 2023-11-24

745 downloads per month
Used in 8 crates (5 directly)

MIT license

7KB
101 lines

cang-jie(仓颉)

Crates.io latest document dependency status

A Chinese tokenizer for tantivy, based on jieba-rs.

As of now, only support UTF-8.

Example

    let mut schema_builder = SchemaBuilder::default();
    let text_indexing = TextFieldIndexing::default()
        .set_tokenizer(CANG_JIE) // Set custom tokenizer
        .set_index_option(IndexRecordOption::WithFreqsAndPositions);
    let text_options = TextOptions::default()
        .set_indexing_options(text_indexing)
        .set_stored();
    // ... Some code   
     let index = Index::create(RAMDirectory::create(), schema.clone())?;
     let tokenizer = CangJieTokenizer {
                        worker: Arc::new(Jieba::empty()), // empty dictionary
                        option: TokenizerOption::Unicode,
                     };
     index.tokenizers().register(CANG_JIE, tokenizer); 
    // ... Some code

Full example

Dependencies

~24–57MB
~789K SLoC