#text #nlp #library


Word segmentation/breaking library

13 releases (6 stable)

1.1.5 Nov 5, 2022
1.1.4 Sep 14, 2022
1.0.0 Dec 22, 2021
0.6.0 Dec 22, 2021
0.1.1 Dec 27, 2017

#164 in Text processing

Download history 10/week @ 2022-11-30 29/week @ 2022-12-07 37/week @ 2022-12-14 21/week @ 2022-12-21 19/week @ 2022-12-28 2/week @ 2023-01-04 19/week @ 2023-01-11 16/week @ 2023-01-18 22/week @ 2023-01-25 28/week @ 2023-02-01 23/week @ 2023-02-08 60/week @ 2023-02-15 21/week @ 2023-02-22 41/week @ 2023-03-01 14/week @ 2023-03-08 13/week @ 2023-03-15

108 downloads per month
Used in 2 crates




Word segmentation library in Rust


use wordcut_engine::load_dict;
use wordcut_engine::Wordcut;
use std::path::Path;

fn main() {
    let dict_path = Path::new(concat!(
    let dict = load_dict(dict_path).unwrap();
    let wordcut = Wordcut::new(dict);
    println!("{}", wordcut.put_delimiters("หมากินไก่", "|"));


wordcut-engine has three steps:

  1. Identifying clusters, which are substrings that must not be split
  2. Identifying edges of split directed acyclic graph (split-DAG); The program does not add edges that break any cluster to the graph.
  3. Tokenizing a string by finding the shortest path in the split-DAG

Identifying clusters

Identifying clusters identify which substrings that must not be split.

  1. Wrapping regular expressions with parentheses

For example,


The above rules are wrapped with parentheses as shown below:

  1. Joining regular expressions with vertical bars (|)

for example,

  1. Building a DFA from the joined regular expression using regex-automata

  2. Creating a directed acyclic graph (DAG) by adding edges using the DFA

  3. Identifying clusters following a shortest path of a DAG from step above

Note: wordcut-engine does not allow a context sensitive rule, since it hurts the performance too much. Moreover, instead of longest matching, we use a DAG, and its shortest path to contraint cluster boundary by another cluster, therefore newmm-style context sensitive rules are not required.

Identifying split-DAG edges

In contrary to identifying clusters, identifying split-DAG edges identify what must be split. Split-DAG edge makers, wordcut-engine has three types of split-DAG edge maker, that are:

  1. Dictionary-based maker
  2. Rule-based maker
  3. Default maker (Unk edge builder)

The dictionary-based maker traverses a prefix tree, which is particularly a trie in wordcut-engine and create an edge that matched word in the prefix tree. Rule-based maker uses regex-automata's Regex matcher built from split rules to find longest matched substrings, and add corresponding edges to the graph. wordcut-engine removes edges that break clusters. The example of split rules are shown below:

[\r\t\n ]+

If there is no edge for each of character indice yet, a default maker create a edge that connected the known rightmost boundary.


~104K SLoC