59 releases (33 breaking)

new 0.38.0 Nov 13, 2024
0.36.0 Nov 2, 2024
0.32.2 Jun 29, 2024
0.29.0 Mar 18, 2024
0.3.2 Feb 20, 2020

#571 in Text processing

Download history 2841/week @ 2024-07-24 2970/week @ 2024-07-31 2783/week @ 2024-08-07 2512/week @ 2024-08-14 2368/week @ 2024-08-21 2881/week @ 2024-08-28 3201/week @ 2024-09-04 2862/week @ 2024-09-11 2654/week @ 2024-09-18 3163/week @ 2024-09-25 2584/week @ 2024-10-02 3171/week @ 2024-10-09 3766/week @ 2024-10-16 3582/week @ 2024-10-23 3180/week @ 2024-10-30 3189/week @ 2024-11-06

14,383 downloads per month
Used in 35 crates (14 directly)

MIT license

120KB
2.5K SLoC

Lindera Core

License: MIT Crates.io

A morphological analysis core library for Lindera. This project fork from kuromoji-rs.

This package contains dictionary structures and the viterbi algorithm.

Dictionary format

IPADIC

This repository uses mecab-ipadic.

IPADIC dictionary format

Refer to the manual for details on the IPADIC dictionary format and part-of-speech tags.

Index Name (Japanese) Name (English) Notes
0 表層形 Surface
1 左文脈ID Left context ID
2 右文脈ID Right context ID
3 コスト Cost
4 品詞 Major POS classification
5 品詞細分類1 Middle POS classification
6 品詞細分類2 Small POS classification
7 品詞細分類3 Fine POS classification
8 活用形 Conjugation type
9 活用型 Conjugation form
10 原形 Base form
11 読み Reading
12 発音 Pronunciation

IPADIC user dictionary format (CSV)

IPADIC user dictionary simple version
Index Name (Japanese) Name (English) Notes
0 表層形 surface
1 品詞 Major POS classification
2 読み Reading
IPADIC user dictionary detailed version
Index Name (Japanese) Name (English) Notes
0 表層形 Surface
1 左文脈ID Left context ID
2 右文脈ID Right context ID
3 コスト Cost
4 品詞 POS
5 品詞細分類1 POS subcategory 1
6 品詞細分類2 POS subcategory 2
7 品詞細分類3 POS subcategory 3
8 活用形 Conjugation type
9 活用型 Conjugation form
10 原形 Base form
11 読み Reading
12 発音 Pronunciation
13 - - After 13, it can be freely expanded.

IPADIC NEologd

This repository uses mecab-ipadic-neologd.

IPADIC NEologd dictionary format

Refer to the manual for details on the IPADIC dictionary format and part-of-speech tags.

Index Name (Japanese) Name (English) Notes
0 表層形 Surface
1 左文脈ID Left context ID
2 右文脈ID Right context ID
3 コスト Cost
4 品詞 Major POS classification
5 品詞細分類1 Middle POS classification
6 品詞細分類2 Small POS classification
7 品詞細分類3 Fine POS classification
8 活用形 Conjugation type
9 活用型 Conjugation form
10 原形 Base form
11 読み Reading
12 発音 Pronunciation

IPADIC NEologd user dictionary format (CSV)

IPADIC NEologd user dictionary simple version
Index Name (Japanese) Name (English) Notes
0 表層形 surface
1 品詞 Major POS classification
2 読み Reading
IPADIC NEologd user dictionary detailed version
Index Name (Japanese) Name (English) Notes
0 表層形 Surface
1 左文脈ID Left context ID
2 右文脈ID Right context ID
3 コスト Cost
4 品詞 POS
5 品詞細分類1 POS subcategory 1
6 品詞細分類2 POS subcategory 2
7 品詞細分類3 POS subcategory 3
8 活用形 Conjugation type
9 活用型 Conjugation form
10 原形 Base form
11 読み Reading
12 発音 Pronunciation
13 - - After 13, it can be freely expanded.

UniDic

This repository uses unidic-mecab.

UniDic dictionary format

Refer to the manual for details on the unidic-mecab dictionary format and part-of-speech tags.

Index Name (Japanese) Name (English) Notes
0 表層形 Surface
1 左文脈ID Left context ID
2 右文脈ID Right context ID
3 コスト Cost
4 品詞大分類 Major POS classification
5 品詞中分類 Middle POS classification
6 品詞小分類 Small POS classification
7 品詞細分類 Fine POS classification
8 活用型 Conjugation form
9 活用形 Conjugation type
10 語彙素読み Lexeme reading
11 語彙素(語彙素表記 + 語彙素細分類) Lexeme
12 書字形出現形 Orthography appearance type
13 発音形出現形 Pronunciation appearance type
14 書字形基本形 Orthography basic type
15 発音形基本形 Pronunciation basic type
16 語種 Word type
17 語頭変化型 Prefix of a word form
18 語頭変化形 Prefix of a word type
19 語末変化型 Suffix of a word form
20 語末変化形 Suffix of a word type

UniDic user dictionary format (CSV)

UniDic user dictionary simple version
Index Name (Japanese) Name (English) Notes
0 表層形 Surface
1 品詞大分類 Major POS classification
2 語彙素読み Lexeme reading
UniDic user dictionary detailed version
Index Name (Japanese) Name (English) Notes
0 表層形 Surface
1 左文脈ID Left context ID
2 右文脈ID Right context ID
3 コスト Cost
4 品詞大分類 Major POS classification
5 品詞中分類 Middle POS classification
6 品詞小分類 Small POS classification
7 品詞細分類 Fine POS classification
8 活用型 Conjugation form
9 活用形 Conjugation type
10 語彙素読み Lexeme reading
11 語彙素(語彙素表記 + 語彙素細分類) Lexeme
12 書字形出現形 Orthography appearance type
13 発音形出現形 Pronunciation appearance type
14 書字形基本形 Orthography basic type
15 発音形基本形 Pronunciation basic type
16 語種 Word type
17 語頭変化型 Prefix of a word form
18 語頭変化形 Prefix of a word type
19 語末変化型 Suffix of a word form
20 語末変化形 Suffix of a word type
21 - - After 21, it can be freely expanded.

ko-dic

This repository uses mecab-ko-dic.

ko-dic dictionary format

Information about the dictionary format and part-of-speech tags used by mecab-ko-dic id documented in this Google Spreadsheet, linked to from mecab-ko-dic's repository readme.

Note how ko-dic has one less feature column than NAIST JDIC, and has an altogether different set of information (e.g. doesn't provide the "original form" of the word).

The tags are a slight modification of those specified by 세종 (Sejong), whatever that is. The mappings from Sejong to mecab-ko-dic's tag names are given in tab 태그 v2.0 on the above-linked spreadsheet.

The dictionary format is specified fully (in Korean) in tab 사전 형식 v2.0 of the spreadsheet. Any blank values default to *.

Index Name (Korean) Name (English) Notes
0 표면 Surface
1 왼쪽 문맥 ID Left context ID
2 오른쪽 문맥 ID Right context ID
3 비용 Cost
4 품사 태그 part-of-speech tag See 태그 v2.0 tab on spreadsheet
5 의미 부류 meaning (too few examples for me to be sure)
6 종성 유무 presence or absence T for true; F for false; else *
7 읽기 reading usually matches surface, but may differ for foreign words e.g. Chinese character words
8 타입 type One of: Inflect (활용); Compound (복합명사); or Preanalysis (기분석)
9 첫번째 품사 first part-of-speech e.g. given a part-of-speech tag of "VV+EM+VX+EP", would return VV
10 마지막 품사 last part-of-speech e.g. given a part-of-speech tag of "VV+EM+VX+EP", would return EP
11 표현 expression 활용, 복합명사, 기분석이 어떻게 구성되는지 알려주는 필드 – Fields that tell how usage, compound nouns, and key analysis are organized

ko-dic user dictionary format (CSV)

ko-dic user dictionary simple version
Index Name (Japanese) Name (English) Notes
0 표면 Surface
1 품사 태그 part-of-speech tag See 태그 v2.0 tab on spreadsheet
2 읽기 reading usually matches surface, but may differ for foreign words e.g. Chinese character words
ko-dic user dictionary detailed version
Index Name (Korean) Name (English) Notes
0 표면 Surface
1 왼쪽 문맥 ID Left context ID
2 오른쪽 문맥 ID Right context ID
3 비용 Cost
4 품사 태그 part-of-speech tag See 태그 v2.0 tab on spreadsheet
5 의미 부류 meaning (too few examples for me to be sure)
6 종성 유무 presence or absence T for true; F for false; else *
7 읽기 reading usually matches surface, but may differ for foreign words e.g. Chinese character words
8 타입 type One of: Inflect (활용); Compound (복합명사); or Preanalysis (기분석)
9 첫번째 품사 first part-of-speech e.g. given a part-of-speech tag of "VV+EM+VX+EP", would return VV
10 마지막 품사 last part-of-speech e.g. given a part-of-speech tag of "VV+EM+VX+EP", would return EP
11 표현 expression 활용, 복합명사, 기분석이 어떻게 구성되는지 알려주는 필드 – Fields that tell how usage, compound nouns, and key analysis are organized
12 - - After 12, it can be freely expanded.

CC-CEDICT

This repository uses CC-CEDICT-MeCab.

CC-CEDICT dictionary format

Refer to the manual for details on the unidic-mecab dictionary format and part-of-speech tags.

Index Name (Chinese) Name (English) Notes
0 表面形式 Surface
1 左语境ID Left context ID
2 右语境ID Right context ID
3 成本 Cost
4 词类 Major POS classification
5 词类1 Middle POS classification
6 词类2 Small POS classification
7 词类3 Fine POS classification
8 併音 pinyin
9 繁体字 traditional
10 簡体字 simplified
11 定义 definition

CC-CEDICT user dictionary format (CSV)

CC-CEDICT user dictionary simple version
Index Name (Chinese) Name (English) Notes
0 表面形式 Surface
1 词类 Major POS classification
2 併音 pinyin
CC-CEDICT user dictionary detailed version
Index Name (Chinese) Name (English) Notes
0 表面形式 Surface
1 左语境ID Left context ID
2 右语境ID Right context ID
3 成本 Cost
4 词类 POS
5 词类1 POS subcategory 1
6 词类2 POS subcategory 2
7 词类3 POS subcategory 3
8 併音 pinyin
9 繁体字 traditional
10 簡体字 simplified
11 定义 definition
12 - - After 12, it can be freely expanded.

API reference

The API reference is available. Please see following URL:

Dependencies

~20–30MB
~608K SLoC