13 releases (8 breaking)

new 0.30.0 Apr 13, 2024
0.29.0 Mar 18, 2024
0.28.0 Feb 23, 2024
0.27.2 Dec 30, 2023
0.23.0 Feb 23, 2023

#1864 in Text processing

Download history 51/week @ 2023-12-22 33/week @ 2023-12-29 55/week @ 2024-01-05 74/week @ 2024-01-12 68/week @ 2024-01-19 50/week @ 2024-01-26 54/week @ 2024-02-02 62/week @ 2024-02-09 61/week @ 2024-02-16 184/week @ 2024-02-23 112/week @ 2024-03-01 72/week @ 2024-03-08 252/week @ 2024-03-15 112/week @ 2024-03-22 131/week @ 2024-03-29 63/week @ 2024-04-05

566 downloads per month
Used in 4 crates (2 directly)

MIT license

450KB
10K SLoC

Lindera Filter

License: MIT Join the chat at https://gitter.im/lindera-morphology/lindera

Character and token filters for Lindera.

Character filters

Japanese iteration mark filter

Normalizes Japanese horizontal iteration marks (odoriji) to their expanded form. Sequences of iteration marks are supported. In case an illegal sequence of iteration marks is encountered, the implementation emits the illegal source character as-is without considering its script. For example, with input "?ゝ", we get "??" even though the question mark isn't hiragana.

Mapping filter

Replace characters with the specified character mappings, and correcting the resulting changes to the offsets. Matching is greedy (longest pattern matching at a given point wins). Replacement is allowed to be the empty string.

Regex filter

Character filter that uses a regular expression for the target of replace string.

Unicode normalize filter

Unicode normalization to normalize the input text, that using the specified normalization form, one of NFC, NFD, NFKC, or NFKD.

Token filters

Japanese base form filter

Replace the term text with the base form registered in the morphological dictionary. This acts as a lemmatizer for verbs and adjectives.

Japanese compound word filter

Compound consecutive tokens that have specified part-of-speech tags into a single token. This is useful for handling compound words that are not registered in the morphological dictionary.

Japanese katakana stem filter

Normalizes common katakana spelling variations ending with a long sound (U+30FC) by removing that character. Only katakana words longer than the minimum length are stemmed.

Japanese keep tags filter

Keep only tokens with the specified part-of-speech tag.

Japanese number filter

Convert tokens representing Japanese numerals, including Kanji numerals, to Arabic numerals.

Japanese reading form filter

Replace the text of a token with the reading of the text as registered in the morphological dictionary. The reading is in katakana.

Japanese stop tags filter

Remove tokens with the specified part-of-speech tag.

Keep words filter

Keep only the tokens of the specified text.

Korean keep tags filter

Keep only tokens with the specified part-of-speech tag.

Korean reading form filter

Replace the text of a token with the reading of the text as registered in the morphological dictionary.

Korean stop tags filter

Remove tokens with the specified part-of-speech tag.

Length filter

Keep only tokens with the specified number of characters of text.

Lowercase filter

Normalizes token text to lowercase.

Mapping filter

Replace characters with the specified character mappings.

Stop words filter

Remove the tokens of the specified text.

Uppercase filter

Normalizes token text to uppercase.

API reference

The API reference is available. Please see following URL:

Dependencies

~14MB
~324K SLoC