#tantivy #tokenizer #stemmer #czech

tantivy-czech-stemmer

Czech stemmer as Tantivy tokenizer

2 unstable releases

0.2.1 May 1, 2024
0.1.1 Apr 28, 2024

#605 in Text processing

Download history 229/week @ 2024-04-27 13/week @ 2024-05-04

71 downloads per month

BSD-3-Clause

58KB
1.5K SLoC

Tantivy tokenizer / Czech stemmer

This library bundles several OSS to provide Czech language stemmer as a Tantivy tokenizer. Tantivy is a full-text search engine library written in Rust. As its default Stemmer tokenizer depends on a dead library rust-stemmers, there are only a very few languages available by default. Nevertheless, Tantivy provides an easy way to build our own custom tokenizers (see the tantivy-tokenizer-api for details).

This repository bundles several OSS projects into 1 library:

  • Algorithms

    Currently only a single algorithm (in an aggressive and light variants) is available: Dolamic. This algorithm has been developed by Ljiljana Dolamic & Jacques Savoy and published under the BSD license. It's written in the Snowball language and is available on the Snowball website.

    There is 1 more stemming algorithm for the Czech language: Hellebrand. This algorithm has been developed by David Hellebrand & Petr Chmelař. It's also written in the Snowball language and is available as a Master's thesis here. However, this algorithm has been published under the GNU license and is therefore not included in this library as we'd like to keep the BSD license on this library. (If you wish, you can always compile the Hellebrand algorithm from Snowball to Rust and include it yourself.)

  • rust-stemmers

    This library (used by Tantivy under the hood) implements a Rust interface for a Snowball algorithms of several languages. This library is inspired by rust-stemmers and some source code is taken directly from rust-stemmers (namely src/snowball/*).

  • Tantivy

    Implementation of the tokenizer in this library is mostly a copy of the original implementation of the Stemmer tokenizer in the Tantivy library. Only instead of different languages, there are available different algorithms for the Czech language. And instead of importing from the tantivy lib, this library imports from tantivy-tokenizer-api.

Usage

use tantivy::Index;
use tantivy::schema::{Schema, TextFieldIndexing, TextOptions, IndexRecordOption};
use tantivy::tokenizer::{LowerCaser, SimpleTokenizer, TextAnalyzer};
use tantivy_czech_stemmer;

fn main() {
    let mut schema_builder = Schema::builder();

    schema_builder.add_text_field(
        "title",
        TextOptions::default()
            .set_indexing_options(
                TextFieldIndexing::default()
                    // Set name of the tokenizer, we will register it shortly
                    .set_tokenizer("lang_cs")
                    .set_index_option(IndexRecordOption::WithFreqsAndPositions),
            )
            .set_stored(),
    );

    let schema = schema_builder.build();
    let index = Index::create_in_ram(schema.clone());

    // Create an instance of the Czech stemmer tokenizer

    // With default algorithm (Dolamic aggressive)
    let stemmer_tokenizer = tantivy_czech_stemmer::tokenizer::Stemmer::default();

    // With a specific algorithm
    // let stemmer_tokenizer = tantivy_czech_stemmer::tokenizer::Stemmer::new(
    //     tantivy_czech_stemmer::tokenizer::Algorithm::DolamicLight,
    // );

    // Before we register it, we need to wrap in an instance
    // of the TextAnalyzer tokenizer. We also have to transform
    // the text to lowercase since our stemmer expects lowercase.
    let czech_tokenizer = TextAnalyzer::builder(
        stemmer_tokenizer.transform(
            LowerCaser.transform(SimpleTokenizer::default())
        ),
    ).build();

    // Register the tokenizer with Tantivy
    index.tokenizers().register("lang_cs", czech_tokenizer);
}

Dependencies

~0.8–1.2MB
~28K SLoC