9 releases

✓ Uses Rust 2018 edition

new 0.3.0 Feb 14, 2020
0.2.1 Feb 12, 2020
0.1.6 Feb 7, 2020

#56 in Text processing

48 downloads per month

MIT license

15KB
142 lines

Lindera CLI

License: MIT Join the chat at https://gitter.im/bayard-search/lindera

A Japanese morphological analysis command-line interface for Lindera. This project fork from fulmicoton's kuromoji-rs.

Install

% cargo install lindera-cli

Build

The following products are required to build:

  • Rust >= 1.39.0
  • make >= 3.81
% make build

Usage

Basic usage

The CLI already includes IPADIC as the default Japanese dictionary.
You can easily tokenize the text and see the results as follows:

% echo "関西国際空港限定トートバッグ" | ./bin/lindera
関西国際空港    名詞,固有名詞,組織,*,*,*,関西国際空港,カンサイコクサイクウコウ,カンサイコクサイクーコー
限定    名詞,サ変接続,*,*,*,*,限定,ゲンテイ,ゲンテイ
トートバッグ    UNK,*,*,*,*,*,*,*,*
EOS

Switching dictionary

It is also possible to switch to the pre-built dictionary data instead of the default dictionary and tokenize. The following example uses the pre-built UniDic to tokenize:

% echo "関西国際空港限定トートバッグ" | ./bin/lindera -d ../lindera-unidic-builder/lindera-unidic-2.1.2
関西    名詞,固有名詞,地名,一般,*,*,関西,カンサイ,カンサイ
国際    名詞,普通名詞,一般,*,*,*,国際,コクサイ,コクサイ
空港    名詞,普通名詞,一般,*,*,*,空港,クーコー,クーコー
限定    名詞,普通名詞,サ変可能,*,*,*,限定,ゲンテー,ゲンテー
トート  名詞,普通名詞,一般,*,*,*,トート,トート,トート
バッグ  名詞,普通名詞,一般,*,*,*,バッグ,バッグ,バッグ
EOS

Please refer to the following repository for building a dictionary:

Tokenize mode

Linera provides two tokenization modes: normal and decompose.

normal mode tokenizes faithfully based on words registered in the dictionary. (Default):

% echo "関西国際空港限定トートバッグ" | ./bin/lindera --mode=normal
関西国際空港    名詞,固有名詞,組織,*,*,*,関西国際空港,カンサイコクサイクウコウ,カンサイコクサイクーコー
限定    名詞,サ変接続,*,*,*,*,限定,ゲンテイ,ゲンテイ
トートバッグ    UNK,*,*,*,*,*,*,*,*
EOS

decopose mode tokenizes a compound noun words additionally:

% echo "関西国際空港限定トートバッグ" | ./bin/lindera --mode=decompose
関西    名詞,固有名詞,地域,一般,*,*,関西,カンサイ,カンサイ
国際    名詞,一般,*,*,*,*,国際,コクサイ,コクサイ
空港    名詞,一般,*,*,*,*,空港,クウコウ,クーコー
限定    名詞,サ変接続,*,*,*,*,限定,ゲンテイ,ゲンテイ
トートバッグ    UNK,*,*,*,*,*,*,*,*
EOS

Output format

Linera provides three output formats: mecab, wakati and json.

mecab outputs results in a format like MeCab:

% echo "お待ちしております。" | ./bin/lindera --output=mecab
お待ち	名詞,サ変接続,*,*,*,*,お待ち,オマチ,オマチ
し	動詞,自立,*,*,サ変・スル,連用形,する,,シ
て	助詞,接続助詞,*,*,*,*,,,テ
おり	動詞,非自立,*,*,五段・ラ行,連用形,おる,オリ,オリ
ます	助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
。	記号,句点,*,*,*,*,,,EOS

wakati outputs the token text separated by spaces:

% echo "お待ちしております。" | ./bin/lindera --output=wakati
お待ち し て おり ます 。

json outputs the token information in JSON format:

% echo "お待ちしております。" | ./bin/lindera --output=json
[
  {
    "text": "お待ち",
    "detail": {
      "left_id": 1283,
      "right_id": 1283,
      "word_cost": 6376,
      "pos_level1": "名詞",
      "pos_level2": "サ変接続",
      "pos_level3": "*",
      "pos_level4": "*",
      "conjugation_type": "*",
      "conjugate_form": "*",
      "base_form": "お待ち",
      "reading": "オマチ",
      "pronunciation": "オマチ"
    }
  },
  {
    "text": "",
    "detail": {
      "left_id": 610,
      "right_id": 610,
      "word_cost": 8718,
      "pos_level1": "動詞",
      "pos_level2": "自立",
      "pos_level3": "*",
      "pos_level4": "*",
      "conjugation_type": "サ変・スル",
      "conjugate_form": "連用形",
      "base_form": "する",
      "reading": "",
      "pronunciation": ""
    }
  },
  {
    "text": "",
    "detail": {
      "left_id": 307,
      "right_id": 307,
      "word_cost": 5170,
      "pos_level1": "助詞",
      "pos_level2": "接続助詞",
      "pos_level3": "*",
      "pos_level4": "*",
      "conjugation_type": "*",
      "conjugate_form": "*",
      "base_form": "",
      "reading": "",
      "pronunciation": ""
    }
  },
  {
    "text": "おり",
    "detail": {
      "left_id": 1197,
      "right_id": 1197,
      "word_cost": 8773,
      "pos_level1": "動詞",
      "pos_level2": "非自立",
      "pos_level3": "*",
      "pos_level4": "*",
      "conjugation_type": "五段・ラ行",
      "conjugate_form": "連用形",
      "base_form": "おる",
      "reading": "オリ",
      "pronunciation": "オリ"
    }
  },
  {
    "text": "ます",
    "detail": {
      "left_id": 491,
      "right_id": 491,
      "word_cost": 5537,
      "pos_level1": "助動詞",
      "pos_level2": "*",
      "pos_level3": "*",
      "pos_level4": "*",
      "conjugation_type": "特殊・マス",
      "conjugate_form": "基本形",
      "base_form": "ます",
      "reading": "マス",
      "pronunciation": "マス"
    }
  },
  {
    "text": "",
    "detail": {
      "left_id": 8,
      "right_id": 8,
      "word_cost": 215,
      "pos_level1": "記号",
      "pos_level2": "句点",
      "pos_level3": "*",
      "pos_level4": "*",
      "conjugation_type": "*",
      "conjugate_form": "*",
      "base_form": "",
      "reading": "",
      "pronunciation": ""
    }
  }
]

If you output result in JSON format, token can be filtering is easily assured by using with jq command.
For example, folloing command executes:

  1. Tokenize a text
  2. Filter tokens by part of speech (名詞)
  3. Concat the token text with a white space
% echo "すもももももももものうち" | ./bin/lindera --output=json |
    jq -r '.[] | select (.detail.pos_level1 =="名詞")' |
    jq -s -r '. | map(.text) | join(" ")'
すもも もも もも うち

Project links

lindera consists of several projects. The list is following:

Dependencies

~18MB
~138K SLoC