#word #tokenizer #segmenter #split #python-packages

segtok

Sentence segmentation and word tokenization tools

3 releases

new 0.1.2 Jan 19, 2025
0.1.1 Jan 17, 2025
0.1.0 Jan 10, 2025

#1052 in Text processing

Download history 97/week @ 2025-01-05 114/week @ 2025-01-12

211 downloads per month

MIT license

59KB
1K SLoC

segtok

A rule-based sentence segmenter (splitter) and a word tokenizer using orthographic features. Ported from python package (unmaintained), fixes contractions bug.

Dependencies

~2.9–4MB
~75K SLoC