7 stable releases
1.3.2 | Nov 12, 2021 |
---|---|
1.3.1 | Nov 8, 2021 |
1.2.0 | Aug 2, 2021 |
1.1.2 | Jul 20, 2021 |
#739 in Text processing
Used in nlpo3-cli
65KB
1.5K
SLoC
nlpO3
Thai Natural Language Processing library in Rust, with Python and Node bindings. Formerly oxidized-thainlp.
Features
- Thai word tokenizer
- use maximal-matching dictionary-based tokenization algorithm and honor Thai Character Cluster boundaries
- 2.5x faster than similar pure Python implementation (PyThaiNLP's newmm)
- load a dictionary from a plain text file (one word per line) or from
Vec<String>
- use maximal-matching dictionary-based tokenization algorithm and honor Thai Character Cluster boundaries
Dictionary file
- For the interest of library size, nlpO3 does not assume what dictionary the developer would like to use. It does not come with a dictionary. A dictionary is needed for the dictionary-based word tokenizer.
- For tokenization dictionary, try
- words_th.tx from PyThaiNLP - around 62,000 words (CC0)
- word break dictionary from libthai - consists of dictionaries in different categories, with make script (LGPL-2.1)
Usage
Command-line interface
echo "ฉันกินข้าว" | nlpo3 segment
Bindings
from nlpo3 import load_dict, segment
load_dict("path/to/dict.file", "dict_name")
segment("สวัสดีครับ", "dict_name")
As Rust library
In Cargo.toml
:
[dependencies]
# ...
nlpo3 = "1.3.2"
Create a tokenizer using a dictionary from file, then use it to tokenize a string (safe mode = true, and parallel mode = false):
use nlpo3::tokenizer::newmm::NewmmTokenizer;
use nlpo3::tokenizer::tokenizer_trait::Tokenizer;
let tokenizer = NewmmTokenizer::new("path/to/dict.file");
let tokens = tokenizer.segment("ห้องสมุดประชาชน", true, false).unwrap();
Create a tokenizer using a dictionary from a vector of Strings:
let words = vec!["ปาลิเมนต์".to_string(), "คอนสติติวชั่น".to_string()];
let tokenizer = NewmmTokenizer::from_word_list(words);
Add words to an existing tokenizer:
tokenizer.add_word(&["มิวเซียม"]);
Remove words from an existing tokenizer:
tokenizer.remove_word(&["กระเพรา", "ชานชลา"]);
Build
Requirements
Steps
Generic test:
cargo test
Build API document and open it to check:
cargo doc --open
Build (remove --release
to keep debug information):
cargo build --release
Check target/
for build artifacts.
Development documents
Issues
Please report issues at https://github.com/PyThaiNLP/nlpo3/issues
Dependencies
~5MB
~84K SLoC