#korean #morphological #dictionary #builder #ko-dic

lindera-ko-dic-builder

A Korean morphological dictionary builder for ko-dic

33 releases (14 breaking)

0.23.0 Feb 23, 2023
0.21.0 Jan 22, 2023
0.19.2 Dec 27, 2022
0.18.0 Oct 26, 2022
0.1.0 Feb 20, 2020

#356 in Text processing

Download history 2081/week @ 2022-11-29 2247/week @ 2022-12-06 2334/week @ 2022-12-13 2165/week @ 2022-12-20 1554/week @ 2022-12-27 1867/week @ 2023-01-03 2113/week @ 2023-01-10 2256/week @ 2023-01-17 2023/week @ 2023-01-24 2363/week @ 2023-01-31 2335/week @ 2023-02-07 2541/week @ 2023-02-14 2757/week @ 2023-02-21 2606/week @ 2023-02-28 3052/week @ 2023-03-07 2403/week @ 2023-03-14

11,298 downloads per month
Used in 14 crates (2 directly)

MIT license

70KB
1.5K SLoC

Lindera ko-dic Builder

License: MIT Join the chat at https://gitter.im/lindera-morphology/lindera

ko-dic dictionary builder for Lindera.

Dictionary version

This repository contains mecab-ko-dic-2.1.1-20180720.

Dictionary format

Information about the dictionary format and part-of-speech tags used by mecab-ko-dic id documented in this Google Spreadsheet, linked to from mecab-ko-dic's repository readme.

Note how ko-dic has one less feature column than NAIST JDIC, and has an altogether different set of information (e.g. doesn't provide the "original form" of the word).

The tags are a slight modification of those specified by 세종 (Sejong), whatever that is. The mappings from Sejong to mecab-ko-dic's tag names are given in tab 태그 v2.0 on the above-linked spreadsheet.

The dictionary format is specified fully (in Korean) in tab 사전 형식 v2.0 of the spreadsheet. Any blank values default to *.

Index Name (Korean) Name (English) Notes
0 표면 Surface
1 왼쪽 문맥 ID Left context ID
2 오른쪽 문맥 ID Right context ID
3 비용 Cost
4 품사 태그 part-of-speech tag See 태그 v2.0 tab on spreadsheet
5 의미 부류 meaning (too few examples for me to be sure)
6 종성 유무 presence or absence T for true; F for false; else *
7 읽기 reading usually matches surface, but may differ for foreign words e.g. Chinese character words
8 타입 type One of: Inflect (활용); Compound (복합명사); or Preanalysis (기분석)
9 첫번째 품사 first part-of-speech e.g. given a part-of-speech tag of "VV+EM+VX+EP", would return VV
10 마지막 품사 last part-of-speech e.g. given a part-of-speech tag of "VV+EM+VX+EP", would return EP
11 표현 expression 활용, 복합명사, 기분석이 어떻게 구성되는지 알려주는 필드 – Fields that tell how usage, compound nouns, and key analysis are organized

User dictionary format (CSV)

Simple version

Index Name (Japanese) Name (English) Notes
0 표면 Surface
1 품사 태그 part-of-speech tag See 태그 v2.0 tab on spreadsheet
2 읽기 reading usually matches surface, but may differ for foreign words e.g. Chinese character words

Detailed version

Index Name (Korean) Name (English) Notes
0 표면 Surface
1 왼쪽 문맥 ID Left context ID
2 오른쪽 문맥 ID Right context ID
3 비용 Cost
4 품사 태그 part-of-speech tag See 태그 v2.0 tab on spreadsheet
5 의미 부류 meaning (too few examples for me to be sure)
6 종성 유무 presence or absence T for true; F for false; else *
7 읽기 reading usually matches surface, but may differ for foreign words e.g. Chinese character words
8 타입 type One of: Inflect (활용); Compound (복합명사); or Preanalysis (기분석)
9 첫번째 품사 first part-of-speech e.g. given a part-of-speech tag of "VV+EM+VX+EP", would return VV
10 마지막 품사 last part-of-speech e.g. given a part-of-speech tag of "VV+EM+VX+EP", would return EP
11 표현 expression 활용, 복합명사, 기분석이 어떻게 구성되는지 알려주는 필드 – Fields that tell how usage, compound nouns, and key analysis are organized
12 - - After 12, it can be freely expanded.

How to use ko-dic dictionary

For more details about lindera command, please refer to the following URL:

API reference

The API reference is available. Please see following URL:

Dependencies

~10MB
~252K SLoC