#text #nlp #library

wordcut-engine

Word segmentation/breaking library

12 releases (5 stable)

1.1.4 Sep 14, 2022
1.0.0 Dec 22, 2021
0.6.0 Dec 22, 2021
0.5.0 Jul 27, 2021
0.1.1 Dec 27, 2017

#268 in Text processing

Download history 57/week @ 2022-06-04 10/week @ 2022-06-11 28/week @ 2022-06-18 21/week @ 2022-06-25 23/week @ 2022-07-02 90/week @ 2022-07-09 71/week @ 2022-07-16 30/week @ 2022-07-23 29/week @ 2022-07-30 63/week @ 2022-08-06 22/week @ 2022-08-13 28/week @ 2022-08-20 18/week @ 2022-08-27 17/week @ 2022-09-03 107/week @ 2022-09-10 40/week @ 2022-09-17

191 downloads per month
Used in 2 crates

Apache-2.0

320KB
1K SLoC

wordcut-engine

Word segmentation library in Rust

Example

use wordcut_engine::load_dict;
use wordcut_engine::Wordcut;
use std::path::Path;

fn main() {
    let dict_path = Path::new(concat!(
        env!("CARGO_MANIFEST_DIR"),
        "/dict.txt"
    ));
    let dict = load_dict(dict_path).unwrap();
    let wordcut = Wordcut::new(dict);
    println!("{}", wordcut.put_delimiters("หมากินไก่", "|"));
}

Algorithm

wordcut-engine has three steps:

  1. Identifying clusters, which are substrings that must not be split
  2. Identifying edges of split directed acyclic graph (split-DAG); The program does not add edges that break any cluster to the graph.
  3. Tokenizing a string by finding the shortest path in the split-DAG

Identifying clusters

Identifying clusters identify which substrings that must not be split.

  1. Wrapping regular expressions with parentheses

For example,

[-][-][-]
[-][-][ะาำ]

The above rules are wrapped with parentheses as shown below:

([-])
([-][-])
([-][-][ะาำ])
  1. Joining regular expressions with vertical bars (|)

for example,

([-])|([-][-])|([-][-][ะาำ])
  1. Building a DFA from the joined regular expression using regex-automata

  2. Creating a directed acyclic graph (DAG) by adding edges using the DFA

  3. Identifying clusters following a shortest path of a DAG from step above

Note: wordcut-engine does not allow a context sensitive rule, since it hurts the performance too much. Moreover, instead of longest matching, we use a DAG, and its shortest path to contraint cluster boundary by another cluster, therefore newmm-style context sensitive rules are not required.

Identifying split-DAG edges

In contrary to identifying clusters, identifying split-DAG edges identify what must be split. Split-DAG edge makers, wordcut-engine has three types of split-DAG edge maker, that are:

  1. Dictionary-based maker
  2. Rule-based maker
  3. Default maker (Unk edge builder)

The dictionary-based maker traverses a prefix tree, which is particularly a trie in wordcut-engine and create an edge that matched word in the prefix tree. Rule-based maker uses regex-automata's Regex matcher built from split rules to find longest matched substrings, and add corresponding edges to the graph. wordcut-engine removes edges that break clusters. The example of split rules are shown below:

[\r\t\n ]+
[A-Za-z]+
[0-9]+
[-]+
[\(\)"'`\[\]{}\\/]

If there is no edge for each of character indice yet, a default maker create a edge that connected the known rightmost boundary.

Dependencies

~3.5–4.5MB
~95K SLoC