#japanese #morphological #dictionary #builder #ipadic

bin+lib lindera-ipadic-builder

A Japanese morphological dictionary builder for IPADIC

5 unstable releases

✓ Uses Rust 2018 edition

0.3.2 Feb 20, 2020
0.3.1 Feb 17, 2020
0.3.0 Feb 14, 2020
0.2.0 Feb 10, 2020
0.1.0 Feb 7, 2020

#6 in #dictionary

26 downloads per month

MIT license

23KB
510 lines

Lindera IPADIC Builder

License: MIT Join the chat at https://gitter.im/lindera-morphology/lindera

IPADIC dictionary builder for Lindera. This project fork from fulmicoton's kuromoji-rs.

Install

% cargo install lindera-ipadic-builder

Build

The following products are required to build:

  • Rust >= 1.39.0
  • make >= 3.81
% make lindera-ipadic

Dictionary version

This repository contains mecab-ipadic-2.7.0-20070801.

Building a dictionary

Building a dictionary with lindera-ipadic command:

% ./bin/lindera-ipadic ./mecab-ipadic-2.7.0-20070801 ./lindera-ipadic-2.7.0-20070801

Dictionary format

Refer to the manual for details on the IPADIC dictionary format and part-of-speech tags.

Index Name (Japanese) Name (English) Notes
0 品詞 part-of-speech
1 品詞細分類1 sub POS 1
2 品詞細分類2 sub POS 2
3 品詞細分類3 sub POS 3
4 活用形 conjugation type
5 活用型 conjugation form
6 原形 base form
7 読み reading
8 発音 pronunciation

Tokenizing text using produced dictionary

You can tokenize text using produced dictionary with lindera command:

% echo "羽田空港限定トートバッグ" | lindera -d ./lindera-ipadic-2.7.0-20070801
羽田空港        名詞,固有名詞,一般,*,*,*,羽田空港,ハネダクウコウ,ハネダクーコー
限定    名詞,サ変接続,*,*,*,*,限定,ゲンテイ,ゲンテイ
トートバッグ    UNK,*,*,*,*,*,*,*,*
EOS

For more details about lindera command, please refer to the following URL:

API reference

The API reference is available. Please see following URL:

Dependencies

~7MB
~123K SLoC