10 releases (breaking)

0.31.0 May 28, 2024
0.29.0 Mar 18, 2024
0.27.2 Dec 30, 2023
0.27.1 Aug 25, 2023
0.27.0 Jul 10, 2023

#390 in Text processing

Download history 2881/week @ 2024-02-15 3349/week @ 2024-02-22 2794/week @ 2024-02-29 2746/week @ 2024-03-07 4115/week @ 2024-03-14 3850/week @ 2024-03-21 3332/week @ 2024-03-28 3612/week @ 2024-04-04 3352/week @ 2024-04-11 3044/week @ 2024-04-18 3143/week @ 2024-04-25 3388/week @ 2024-05-02 3047/week @ 2024-05-09 3432/week @ 2024-05-16 3172/week @ 2024-05-23 2368/week @ 2024-05-30

12,742 downloads per month
Used in 15 crates (9 directly)

MIT license

170KB
4.5K SLoC

Lindera Tokenizer

License: MIT Join the chat at https://gitter.im/lindera-morphology/lindera Crates.io

A morphological analysis library in Rust. This project fork from kuromoji-rs.

Lindera aims to build a library which is easy to install and provides concise APIs for various Rust applications.

The following products are required to build:

  • Rust >= 1.46.0

Usage

Put the following in Cargo.toml:

[dependencies]
lindera-tokenizer = { version = "0.24.0", features = ["ipadic"] }

Basic example

This example covers the basic usage of Lindera.

It will:

  • Create a tokenizer in normal mode
  • Tokenize the input text
  • Output the tokens
use lindera_tokenizer::tokenizer::Tokenizer;
use lindera_core::LinderaResult;

fn main() -> LinderaResult<()> {
    let dictionary = DictionaryConfig {
        kind: Some(DictionaryKind::IPADIC),
        path: None,
    };

    let config = TokenizerConfig {
        dictionary,
        user_dictionary: None,
        mode: Mode::Normal,
    };

    // create tokenizer
    let tokenizer = Tokenizer::from_config(config)?;

    // tokenize the text
    let tokens = tokenizer.tokenize("関西国際空港限定トートバッグ")?;

    // output the tokens
    for token in tokens {
        println!("{}", token.text);
    }

    Ok(())
}

The above example can be run as follows:

% cargo run --features=ipadic --example=ipadic_basic

You can see the result as follows:

関西国際空港
限定
トートバッグ

User dictionary example

You can give user dictionary entries along with the default system dictionary. User dictionary should be a CSV with following format.

<surface>,<part_of_speech>,<reading>

For example:

% cat ./resources/simple_userdic.csv
東京スカイツリー,カスタム名詞,トウキョウスカイツリー
東武スカイツリーライン,カスタム名詞,トウブスカイツリーライン
とうきょうスカイツリー駅,カスタム名詞,トウキョウスカイツリーエキ

With an user dictionary, Tokenizer will be created as follows:

use std::path::PathBuf;

use lindera_tokenizer::tokenizer::{Tokenizer, TokenizerConfig};
use lindera_core::viterbi::Mode;
use lindera_core::LinderaResult;

fn main() -> LinderaResult<()> {
    let dictionary = DictionaryConfig {
        kind: Some(DictionaryKind::IPADIC),
        path: None,
    };

    let user_dictionary = Some(UserDictionaryConfig {
        kind: DictionaryKind::IPADIC,
        path: PathBuf::from("./resources/ipadic_simple_userdic.csv"),
    });

    let config = TokenizerConfig {
        dictionary,
        user_dictionary,
        mode: Mode::Normal,
    };

    let tokenizer = Tokenizer::from_config(config)?;

    // tokenize the text
    let tokens = tokenizer.tokenize("東京スカイツリーの最寄り駅はとうきょうスカイツリー駅です")?;

    // output the tokens
    for token in tokens {
        println!("{}", token.text);
    }

    Ok(())
}

The above example can be by cargo run --example:

% cargo run --features=ipadic --example=ipadic_userdic
東京スカイツリー

最寄り駅

とうきょうスカイツリー駅
です

API reference

The API reference is available. Please see following URL:

Dependencies

~11MB
~236K SLoC