#japanese-morphological #japanese #analyzer #morphological #sudachi

bin+lib sudachiclone

sudachiclone-rs is a Rust version of Sudachi, a Japanese morphological analyzer

4 releases

0.2.1 Mar 17, 2020
0.2.0 Feb 20, 2020
0.1.1 Feb 16, 2020
0.1.0 Feb 15, 2020

#1581 in Text processing

Apache-2.0

240KB
8K SLoC

sudachiclone-rs - SudachiPyClone by rust

sudachiclone at crates.io sudachiclone at docs.rs Actions Status

sudachiclone-rs is a Rust version of Sudachi, a Japanese morphological analyzer.

Install CLI

Setup.1 Install sudachiclone

sudachiclone is distributed from crates.io. You can install sudachiclone by executing cargo install sudachiclone from the command line.

$ cargo install sudachiclone

Setup2. Install dictionary

The default dict package SudachiDict_core is distributed from WorksAppliations Download site. Run pip install like below:

$ pip install https://object-storage.tyo2.conoha.io/v1/nc_2520839e1f9641b08211a5c85243124a/sudachi/SudachiDict_core-20200127.tar.gz

Usage CLI

After installing sudachiclone, you may also use it in the terminal via command sudachiclone.

You can excute sudachiclone with standard input by this way:

$ sudachiclone

sudachiclone has 4 subcommands (default: tokenize)

$ sudachiclone -h
Japanese Morphological Analyzer

USAGE:
    sudachiclone [SUBCOMMAND]

FLAGS:
    -h, --help       Prints help information
    -V, --version    Prints version information

SUBCOMMANDS:
    build       Build Sudachi Dictionary
    help        Prints this message or the help of the given subcommand(s)
    link        Link Default Dict Package
    tokenize    Tokenize Text
    ubuild      Build User Dictionary
$ sudachiclone tokenize -h
sudachiclone-tokenize
Tokenize Text

USAGE:
    sudachiclone tokenize [FLAGS] [OPTIONS] [in_files]...

FLAGS:
    -h, --help       (default) see `tokenize -h`
    -a               print all of the fields
    -d               print the debug information
    -V, --version    Prints version information
    -v               print sudachipy version

OPTIONS:
    -o <fpath_out>            the output file
    -r <fpath_setting>        the setting file in JSON format
    -m <mode>                 the mode of splitting [possible values: A, B, C]

ARGS:
    <in_files>...    text written in utf-8
$ sudachiclone link -h
sudachiclone-link
Link Default Dict Package

USAGE:
    sudachiclone link [OPTIONS]

FLAGS:
    -h, --help       see `link -h`
    -V, --version    Prints version information

OPTIONS:
    -t <dict_type>        dict dict [default: core]  [possible values: small, core, full]
$ sudachiclone build -h
sudachiclone-build
Build Sudachi Dictionary

USAGE:
    sudachiclone build [FLAGS] [OPTIONS] -m [in_files]

FLAGS:
    -h, --help       see `build -h`
    -m               connection matrix file with MeCab's matrix.def format
    -V, --version    Prints version information

OPTIONS:
    -d <description>        description comment to be embedded on dictionary [default: ]
    -o <out_file>           output file (default: system.dic) [default: system.dic]

ARGS:
    <in_files>    source files with CSV format (one of more)

As a Rust package

Here is an example usage:

use sudachiclone::prelude::*;

let dictionary = Dictionary::new(None, None).unwrap();
let tokenizer = dictionary.create();

// Multi-granular tokenization
// using `system_core.dic` or `system_full.dic` version 20190781
// you may not be able to replicate this particular example due to dictionary you use

for m in tokenizer.tokenize("国家公務員", &Some(SplitMode::C), None).unwrap() {
    println!("{}", m.surface());
};
# => 国家公務員

for m in tokenizer.tokenize("国家公務員", &Some(SplitMode::B), None).unwrap() {
    println!("{}", m.surface());
};
# => 国家
# => 公務員

for m in tokenizer.tokenize("国家公務員", &Some(SplitMode::A), None).unwrap() {
    println!("{}", m.surface());
};
# => 国家
# => 公務
# =>// Morpheme information

let m = tokenizer.tokenize("食べ", &Some(SplitMode::A), None).unwrap().get(0).unwrap();
println!("{}", m.surface());
# => 食べ
println!("{}", m.dictionary_form());
# => 食べる
println!("{}", m.reading_form());
# => タベ
println!("{:?}", m.part_of_speech());
# => ["動詞", "一般", "*", "*", "下一段-バ行", "連用形-一般"]

// Normalization

println!("{}", tokenizer.tokenize("附属", &Some(SplitMode::A), None).unwrap().get(0).unwrap().normalized_form());
# => 付属

println!("{}", tokenizer.tokenize("SUMMER", &Some(SplitMode::A), None).unwrap().get(0).unwrap().normalized_form());
# => サマー

println!("{}", tokenizer.tokenize("シュミレーション", &Some(SplitMode::A), None).unwrap().get(0).unwrap().normalized_form());
# => シミュレーション

License

Apache 2.0.

Dependencies

~8–11MB
~276K SLoC