10 breaking releases

new 0.12.1 Jul 26, 2021
0.11.1 Feb 20, 2021
0.10.0 Aug 26, 2020
0.8.0 May 15, 2020
0.1.0 Sep 18, 2018

#6 in #chinese

Download history 35/week @ 2021-04-05 48/week @ 2021-04-12 61/week @ 2021-04-19 26/week @ 2021-04-26 23/week @ 2021-05-03 19/week @ 2021-05-10 28/week @ 2021-05-17 24/week @ 2021-05-24 19/week @ 2021-05-31 34/week @ 2021-06-07 21/week @ 2021-06-14 17/week @ 2021-06-21 29/week @ 2021-06-28 21/week @ 2021-07-05 23/week @ 2021-07-12 73/week @ 2021-07-19

118 downloads per month
Used in 6 crates (2 directly)

MIT license

6KB
101 lines

cang-jie(仓颉)

Crates.io latest document dependency status

A Chinese tokenizer for tantivy, based on jieba-rs.

As of now, only support UTF-8.

Example

    let mut schema_builder = SchemaBuilder::default();
    let text_indexing = TextFieldIndexing::default()
        .set_tokenizer(CANG_JIE) // Set custom tokenizer
        .set_index_option(IndexRecordOption::WithFreqsAndPositions);
    let text_options = TextOptions::default()
        .set_indexing_options(text_indexing)
        .set_stored();
    // ... Some code   
     let index = Index::create(RAMDirectory::create(), schema.clone())?;
     let tokenizer = CangJieTokenizer {
                        worker: Arc::new(Jieba::empty()), // empty dictionary
                        option: TokenizerOption::Unicode,
                     };
     index.tokenizers().register(CANG_JIE, tokenizer); 
    // ... Some code

Full example

Dependencies

~19MB
~263K SLoC