6 releases
new 0.1.5 | Feb 18, 2025 |
---|---|
0.1.4 | Feb 8, 2025 |
0.1.3 | Jan 21, 2025 |
#807 in Text processing
875 downloads per month
Used in yake-rust
63KB
1.5K
SLoC
segtok

A rule-based sentence segmenter (splitter) and a word tokenizer using orthographic features. Ported from the python package (not maintained anymore), and fixes the contractions bug.
use segtok::{segmenter::*, tokenizer::*};
fn main() {
let input = include_str!("../tests/test_google.txt");
let sentences: Vec<Vec<_>> = split_multi(input, SegmentConfig::default())
.into_iter()
.map(|span| split_contractions(web_tokenizer(&span)).collect())
.collect();
}
lib.rs
:
A rule-based sentence segmenter (splitter) and a word tokenizer using orthographic features. Ported from the python package (not maintained anymore), and fixes the contractions bug.
use segtok::{segmenter::*, tokenizer::*};
let input = include_str!("../tests/test_google.txt");
let sentences: Vec<Vec<_>> = split_multi(input, SegmentConfig::default())
.into_iter()
.map(|span| split_contractions(web_tokenizer(&span)).collect())
.collect();
Dependencies
~3–4MB
~76K SLoC