#japanese #morphological #dictionary #builder #unidic

bin+lib lindera-unidic-builder

A Japanese morphological dictionary builder for UniDic

7 releases

✓ Uses Rust 2018 edition

0.3.4 May 22, 2020
0.3.3 Apr 30, 2020
0.3.2 Feb 20, 2020
0.2.0 Feb 10, 2020
0.1.0 Feb 9, 2020

#5 in #dictionary

Download history 31/week @ 2020-02-07 55/week @ 2020-02-14 21/week @ 2020-02-21 20/week @ 2020-02-28 10/week @ 2020-03-06 15/week @ 2020-03-13 19/week @ 2020-03-20 15/week @ 2020-03-27 1/week @ 2020-04-03 25/week @ 2020-04-10 10/week @ 2020-04-17 11/week @ 2020-04-24 11/week @ 2020-05-01 1/week @ 2020-05-08 2/week @ 2020-05-15 22/week @ 2020-05-22

51 downloads per month

MIT license

22KB
462 lines

Lindera UniDic Builder

License: MIT Join the chat at https://gitter.im/lindera-morphology/lindera

UniDic builder for Lindera. This project fork from fulmicoton's kuromoji-rs.

Install

% cargo install lindera-unidic-builder

Build

The following products are required to build:

  • Rust >= 1.39.0
  • make >= 3.81
% cargo build --release

Dictionary version

This project supports UniDic 2.1.2. See detail of UniDic .

Building a dictionary

Building a dictionary with lindera-unidic command:

% UNIDIC_VERSION=2.1.2
% curl -L -O "https://unidic.ninjal.ac.jp/unidic_archive/cwj/${UNIDIC_VERSION}/unidic-mecab-${UNIDIC_VERSION}_src.zip"
% unzip ./unidic-mecab-${UNIDIC_VERSION}_src.zip
% lindera-unidic ./unidic-mecab-${UNIDIC_VERSION}_src ./lindera-unidic-${UNIDIC_VERSION}

Dictionary format

Refer to the manual for details on the unidic-mecab dictionary format and part-of-speech tags.

Index Name (Japanese) Name (English) Notes
0 品詞大分類
1 品詞中分類
2 品詞小分類
3 品詞細分類
4 活用型
5 活用形
6 語彙素読み
7 語彙素(語彙素表記 + 語彙素細分類) Lexeme
8 書字形出現形
9 発音形出現形
10 書字形基本形
11 発音形基本形
12 語種
13 語頭変化型
14 語頭変化形
15 語末変化型
16 語末変化形

Tokenizing text using produced dictionary

You can tokenize text using produced dictionary with lindera command:

% echo "羽田空港限定トートバッグ" | lindera -d ./lindera-unidic-2.1.2
羽田    名詞,固有名詞,人名,姓,*,*,羽田,ハタ,ハタ
空港    名詞,普通名詞,一般,*,*,*,空港,クーコー,クーコー
限定    名詞,普通名詞,サ変可能,*,*,*,限定,ゲンテー,ゲンテー
トート  名詞,普通名詞,一般,*,*,*,トート,トート,トート
バッグ  名詞,普通名詞,一般,*,*,*,バッグ,バッグ,バッグ
EOS

For more details about lindera command, please refer to the following URL:

API reference

The API reference is available. Please see following URL:

Dependencies

~8.5MB
~137K SLoC