13 breaking releases
0.15.0 | Feb 12, 2023 |
---|---|
0.14.0 | Jul 13, 2022 |
0.13.0 | Sep 21, 2021 |
0.12.1 | Jul 26, 2021 |
0.1.0 | Sep 18, 2018 |
#27 in Internationalization (i18n)
652 downloads per month
Used in 8 crates
(5 directly)
6KB
101 lines
cang-jie(仓颉)
A Chinese tokenizer for tantivy, based on jieba-rs.
As of now, only support UTF-8.
Example
let mut schema_builder = SchemaBuilder::default();
let text_indexing = TextFieldIndexing::default()
.set_tokenizer(CANG_JIE) // Set custom tokenizer
.set_index_option(IndexRecordOption::WithFreqsAndPositions);
let text_options = TextOptions::default()
.set_indexing_options(text_indexing)
.set_stored();
// ... Some code
let index = Index::create(RAMDirectory::create(), schema.clone())?;
let tokenizer = CangJieTokenizer {
worker: Arc::new(Jieba::empty()), // empty dictionary
option: TokenizerOption::Unicode,
};
index.tokenizers().register(CANG_JIE, tokenizer);
// ... Some code
Dependencies
~19–48MB
~726K SLoC