2 unstable releases
0.2.1 | May 1, 2024 |
---|---|
0.1.1 | Apr 28, 2024 |
#1112 in Text processing
58KB
1.5K
SLoC
Tantivy tokenizer / Czech stemmer
This library bundles several OSS to provide Czech language stemmer as a Tantivy tokenizer. Tantivy is a full-text search engine library written in Rust. As its default Stemmer
tokenizer depends on a dead library rust-stemmers
, there are only a very few languages available by default. Nevertheless, Tantivy provides an easy way to build our own custom tokenizers (see the tantivy-tokenizer-api for details).
This repository bundles several OSS projects into 1 library:
-
Algorithms
Currently only a single algorithm (in an
aggressive
andlight
variants) is available:Dolamic
. This algorithm has been developed by Ljiljana Dolamic & Jacques Savoy and published under the BSD license. It's written in the Snowball language and is available on the Snowball website.There is 1 more stemming algorithm for the Czech language:
Hellebrand
. This algorithm has been developed by David Hellebrand & Petr Chmelař. It's also written in the Snowball language and is available as a Master's thesis here. However, this algorithm has been published under the GNU license and is therefore not included in this library as we'd like to keep the BSD license on this library. (If you wish, you can always compile theHellebrand
algorithm from Snowball to Rust and include it yourself.) -
This library (used by Tantivy under the hood) implements a Rust interface for a Snowball algorithms of several languages. This library is inspired by
rust-stemmers
and some source code is taken directly fromrust-stemmers
(namelysrc/snowball/*
). -
Implementation of the tokenizer in this library is mostly a copy of the original implementation of the
Stemmer
tokenizer in the Tantivy library. Only instead of different languages, there are available different algorithms for the Czech language. And instead of importing from thetantivy
lib, this library imports fromtantivy-tokenizer-api
.
Usage
use tantivy::Index;
use tantivy::schema::{Schema, TextFieldIndexing, TextOptions, IndexRecordOption};
use tantivy::tokenizer::{LowerCaser, SimpleTokenizer, TextAnalyzer};
use tantivy_czech_stemmer;
fn main() {
let mut schema_builder = Schema::builder();
schema_builder.add_text_field(
"title",
TextOptions::default()
.set_indexing_options(
TextFieldIndexing::default()
// Set name of the tokenizer, we will register it shortly
.set_tokenizer("lang_cs")
.set_index_option(IndexRecordOption::WithFreqsAndPositions),
)
.set_stored(),
);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema.clone());
// Create an instance of the Czech stemmer tokenizer
// With default algorithm (Dolamic aggressive)
let stemmer_tokenizer = tantivy_czech_stemmer::tokenizer::Stemmer::default();
// With a specific algorithm
// let stemmer_tokenizer = tantivy_czech_stemmer::tokenizer::Stemmer::new(
// tantivy_czech_stemmer::tokenizer::Algorithm::DolamicLight,
// );
// Before we register it, we need to wrap in an instance
// of the TextAnalyzer tokenizer. We also have to transform
// the text to lowercase since our stemmer expects lowercase.
let czech_tokenizer = TextAnalyzer::builder(
stemmer_tokenizer.transform(
LowerCaser.transform(SimpleTokenizer::default())
),
).build();
// Register the tokenizer with Tantivy
index.tokenizers().register("lang_cs", czech_tokenizer);
}
Dependencies
~0.3–1MB
~21K SLoC