#tantivy #tokenizer #stemmer

tantivy-stemmers

A collection of Tantivy stemmer tokenizers

4 releases (breaking)

0.4.0 Jun 27, 2024
0.3.0 Jun 1, 2024
0.2.0 May 4, 2024
0.1.0 May 2, 2024

#149 in Text processing

Download history 368/week @ 2024-04-30 4/week @ 2024-05-07 222/week @ 2024-05-28 18/week @ 2024-06-04 4/week @ 2024-06-11 146/week @ 2024-06-25 11/week @ 2024-07-02

161 downloads per month

BSD-3-Clause

1MB
26K SLoC

Tantivy stemmers (tokenizer)

This library bundles several OSS to provide a collection of stemming algorithms for various languages as a Tantivy tokenizer. Tantivy is a full-text search engine library written in Rust. As its default Stemmer tokenizer depends on a less then alive library rust-stemmers, there are only a very few languages available by default. Nevertheless, Tantivy provides an easy way to build our own custom tokenizers (see the tantivy-tokenizer-api for details).

This library compiles several OSS projects into 1 library:

  • snowballstem/snowball

    All the raw algorithms in this library are written in the Snowball language, then complied into a Rust code using the Snowball compiler - all these generated algorithms are located at src/snowball/algorithms/*. A Snowball environment is then needed to execute the generated algorithm. This environment is comprised of files src/snowball/among.rs and src/snowball/env.rs - both files have been provided (ie. copied) from the official Snowball repository: rust/src/snowball.

  • Tantivy

    Implementation of the Stemmer in this library is more or less a copy of the original implementation of the Stemmer tokenizer in the Tantivy library. Only this lib does not depend on the rust-stemmers package and instead includes various algorithms in it self. And instead of importing from the tantivy lib, this library imports from tantivy-tokenizer-api.

  • Algorithms

    Most, if not all, stemming algorithms are obtained from the official Snowball website and compiled using the Snowball compiler into Rust. More information about individual algorithm licenses are noted below - most are published under the BSD license.

Cargo features

As this library bundles many algorithms and contains lots of generated code, it would be nice not to have to include it all in our final build. For this reason, each algorithm is published as a Cargo feature. In order to use a specific algorithm, you have to install the appropriate feature first. If you want to use, say, the Dolamic algorithm for Czech in the aggressive variant, your Cargo.toml should look like this:

# ...
[dependencies]
tantivy-stemmers = { version = "0.3.0", features = ["default", "czech_dolamic_aggressive"] }
# ...

See the features table under Supported algorithms below.

Usage

use tantivy::Index;
use tantivy::schema::{Schema, TextFieldIndexing, TextOptions, IndexRecordOption};
use tantivy::tokenizer::{LowerCaser, SimpleTokenizer, TextAnalyzer};
use tantivy_tokenizer_api::TokenFilter;
use tantivy_stemmers;

fn main() {
    let mut schema_builder = Schema::builder();

    schema_builder.add_text_field(
        "title",
        TextOptions::default()
            .set_indexing_options(
                TextFieldIndexing::default()
                    // Set name of the tokenizer, we will register it shortly
                    .set_tokenizer("lang_cs")
                    .set_index_option(IndexRecordOption::WithFreqsAndPositions),
            )
            .set_stored(),
    );

    let schema = schema_builder.build();
    let index = Index::create_in_ram(schema.clone());

    // Create an instance of the StemmerTokenizer

    // With default algorithm (default algorithm is [`tantivy_stemmers::algorithms::english_porter_2`])
    // let stemmer = tantivy_stemmers::StemmerTokenizer::default();

    // With a specific algorithm
    let stemmer = tantivy_stemmers::StemmerTokenizer::new(
        tantivy_stemmers::algorithms::czech_dolamic_aggressive,
    );

    // Before we register it, we need to wrap it in an instance
    // of the TextAnalyzer tokenizer.
    // ❗️ We also have to transform the text to lowercase since
    // the stemmer expects lowercase.
    let tokenizer = TextAnalyzer::builder(
        stemmer.transform(LowerCaser.transform(SimpleTokenizer::default())),
    ).build();

    // Register our tokenizer with Tantivy under a custom name
    index.tokenizers().register("lang_cs", tokenizer);
}

Supported algorithms

List of available Cargo features

Feature Default Language Notes
arabic - Arabic
armenian_mkrtchyan - Armenian
basque - Basque
catalan - Catalan
czech_dolamic_aggressive - Czech
czech_dolamic_light - Czech
danish - Danish
dutch - Dutch
english_lovins - English
english_porter - English Porter has been deprecated in favour of Porter 2
english_porter_2 👈 this English
estonian_freienthal - Estonian
finnish - Finnish
french - French
german - German
greek - Greek
hindi_lightweight - Hindi
hungarian - Hungarian
indonesian_tala - Indonesian
irish_gaelic - Irish
italian - Italian
lithuanian_jocas - Lithuanian
nepali - Nepali
norwegian_bokmal - Norwegian
polish_yarovoy - Polish Non-Snowball alg.
polish_yarovoy_unaccented - Polish Non-Snowball alg.; besides stemming, this alg. also removes accents
portuguese - Portuguese
romanian_heidelberg - Romanian
romanian_tirdea - Romanian
romanian - Romanian
russian - Russian
spanish - Spanish
swedish - Swedish
turkish_cilden - Turkish
yiddish_urieli - Yiddish

Notes on individual algorithms and their sources

  • Arabic

    The Arabic Snowball algorithm was developed by Assem Chelli and Abdelkrim Aries. Its source code has been obtained under the BSD license from the official Snowball GitHub repository.

  • Armenian

    The Armenian Snowball algorithm was developed by Astghik Mkrtchyan and source code has been obtained under the BSD license from the official Snowball website.

  • Basque

    The Basque Snowball algorithm was obtained under the BSD license from the official Snowball website.

  • Catalan

    The Catalan Snowball algorithm was obtained under the BSD license from the official Snowball website.

  • Czech

    Currently only a single algorithm (in an aggressive and light variants) is available: Dolamic. This algorithm has been developed by Ljiljana Dolamic & Jacques Savoy and published under the BSD license. It's written in the Snowball language and is available on the Snowball website.

    There is 1 more stemming algorithm for the Czech language: Hellebrand. This algorithm has been developed by David Hellebrand & Petr Chmelař. It's also written in the Snowball language and is available as a Master's thesis here. However, this algorithm has been published under the GNU license and is therefore not included in this library as we'd like to keep the BSD license on this library. (If you wish, you can always compile the Hellebrand algorithm from Snowball to Rust and include it yourself.)

  • Danish

    The Danish Snowball algorithm was obtained under the BSD license from the official Snowball website.

  • Dutch

    The Dutch Snowball algorithm was obtained under the BSD license from the official Snowball website.

  • English

    Three english algorithms in Snowball are available from the official Snowball website - the Porter, Porter 2 and Lovins. (At least) the first two algorithms have been developed by Dr. Martin Porter. The Porter algorithm (original) is used as a default algorithm in this library. If you wish, you can specify to use the newer Porter 2 algorithm (Algorithm::EnglishPorter2) or the Lovins algorithm (Algorithm::EnglishLovins).

  • Estonian

    The Estonian Snowball algorithm was developed by Linda Freienthal in 2019 and obtained under the BSD license from the official Snowball website.

  • Finnish

    The Finnish Snowball algorithm was obtained under the BSD license from the official Snowball website.

  • French

    The French Snowball algorithm was obtained under the BSD license from the official Snowball website.

  • German

    The German Snowball algorithm was obtained under the BSD license from the official Snowball website.

  • Greek

    The Greek Snowball algorithm has been developed by Georgios Ntais in 2006 and later enhanced by Spyridon Saroukos in 2008. The source code has been obtained under the BSD license from the official Snowball website.

  • Hindi

    The Hindi (lightweight) Snowball algorithm was developed by A. Ramanathan and D. Rao in 2003. Its source code has been obtained under the BSD license from the official Snowball website.

  • Hungarian

    The Hungarian Snowball algorithm was obtained under the BSD license from the official Snowball website.

  • Indonesian

    The Indonesian Snowball algorithm was developed by Fadillah Z. Tala in 2003 and source codes has been obtained under the BSD license from the official Snowball website.

  • Irish (Gaelic)

    The Irish (Gaelic) Snowball algorithm was obtained under the BSD license from the official Snowball website.

  • Italian

    The Italian Snowball algorithm was obtained under the BSD license from the official Snowball website.

  • Lithuanian

    The Lithuanian Snowball algorithm (LithuanianJocas) was contributed by Dainius Jocas. Its source code has been obtained under the BSD license from the official Snowball website.

  • Nepali

    The Nepali Snowball algorithm (LithuanianJocas) was contributed by Dainius Jocas. Its source code has been obtained under the BSD license from the official Snowball website. The Nepali Snowball algorithm was developed by Ingroj Shrestha, Oleg Bartunov and Shreeya Singh. Its source code has been obtained under the BSD license from the official Snowball GitHub repository.

  • Norwegian (Bokmål)

    The Norwegian (Bokmål variant) Snowball algorithm was obtained under the BSD license from the official Snowball website.

  • Polish

    While there are a few distinct stemming algorithms for the Polish language, there's not a single Polish (OSS) stemming algorithm implemented in the Snowball language. Namely, the most popular stemming algorithm Stempel is implemented in Java. There are also its ports to Python and Go.

    There is 1 Polish stemming algorithm with 2 variants in this library: polish_yarovoy and polish_yarovoy_unaccented. It has been ported to Rust from a Go implementation by Nikolay Yarovoy, which in turn was inspired by Python implementation by Błażej Kubiński.
    There are 2 variants of this algorithm: polish_yarovoy stems Polish words and leaves accents as are, while the polish_yarovoy_unaccented stems Polish words and also removes all the accents.

  • Portuguese

    The Portuguese Snowball algorithm was obtained under the BSD license from the official Snowball website.

  • Romanian

    Three Snowball algorithms for the Romainian language are available: Romanian, RomanianHeidelberg and RomanianTirdea. All algorithm were obtained under the BSD license from the official Snowball website and Snowball website.

    The RomanianHeidelberg algorithm has been developed in 2006 by Marina Stegarescu, Doina Gliga and Erwin Glockner at the Ruprecht-Karls-University of Heidelberg (Department of Computational Linguistics).

    The RomanianTirdea has been developed in 2006 by Irina Tirdea.

  • Russian

    The Russian Snowball algorithm was obtained under the BSD license from the official Snowball website.

  • Spanish

    The Spanish Snowball algorithm was obtained under the BSD license from the official Snowball website.

  • Swedish

    The Swedish Snowball algorithm was obtained under the BSD license from the official Snowball website.

  • Turkish

    The Turkish Snowball algorithm was developed by Evren (Kapusuz) Çilden in 2007. The source code has been obtained under the BSD license from the official Snowball website.

    Note from the Snowball website

    The Turkish stemming algorithm was provided by Evren Kapusuz Cilden. It stems only noun and nominal verb suffixes because noun stems are more important for information retrieval, and only handling these simplifies the algorithm significantly.

    In her paper (linked above) Evren explains

    The stemmer can be enhanced to stem all kinds of verb suffixes. In Turkish, there are over fifty suffixes that can be affixed to verbs [2]. The morphological structure of verb suffixes is more complicated than noun suffixes. Despite this, one can use the methodology presented in this paper to enhance the stemmer to find stems of all kinds of Turkish words.

  • Yiddish

    The Yiddish Snowball algorithm was created by Assaf Urieli in 2020 and obtained under the BSD license from the official Snowball website.

Dependencies

~2.2–3.5MB
~80K SLoC