#tokenizer #segmenter #word #split

segtok

Sentence segmentation and word tokenization tools

6 releases

0.1.5 Feb 18, 2025
0.1.4 Feb 8, 2025
0.1.3 Jan 21, 2025

#837 in Text processing

Download history 117/week @ 2025-01-07 180/week @ 2025-01-14 123/week @ 2025-01-21 38/week @ 2025-01-28 627/week @ 2025-02-04 20/week @ 2025-02-11 164/week @ 2025-02-18 9/week @ 2025-02-25 4/week @ 2025-03-04 109/week @ 2025-03-11

289 downloads per month
Used in yake-rust

MIT license

63KB
1.5K SLoC

segtok

A rule-based sentence segmenter (splitter) and a word tokenizer using orthographic features. Ported from the python package (not maintained anymore), and fixes the contractions bug.

use segtok::{segmenter::*, tokenizer::*};

fn main() {
    let input = include_str!("../tests/test_google.txt");

    let sentences: Vec<Vec<_>> = split_multi(input, SegmentConfig::default())
        .into_iter()
        .map(|span| split_contractions(web_tokenizer(&span)).collect())
        .collect();
}

lib.rs:

A rule-based sentence segmenter (splitter) and a word tokenizer using orthographic features. Ported from the python package (not maintained anymore), and fixes the contractions bug.

use segtok::{segmenter::*, tokenizer::*};

let input = include_str!("../tests/test_google.txt");

let sentences: Vec<Vec<_>> = split_multi(input, SegmentConfig::default())
    .into_iter()
    .map(|span| split_contractions(web_tokenizer(&span)).collect())
    .collect();

Dependencies

~2.9–4MB
~74K SLoC