29 releases

0.5.9	Mar 5, 2023
0.5.8	Jul 23, 2021
0.5.7	Mar 16, 2021
0.5.2	Nov 1, 2020
0.2.2	Feb 26, 2020

#934 in Machine learning

58 downloads per month

MIT license

33KB
665 lines

NNSplit

A tool to split text using a neural network. The main application is sentence boundary detection, but e. g. compound splitting for German is also supported.

Features

Robust: Not reliant on proper punctuation, spelling and case. See the metrics.
Small: NNSplit uses a byte-level LSTM, so weights are small (< 4MB) and models can be trained for every unicode encodable language.
Portable: NNSplit is written in Rust with bindings for Rust, Python, and Javascript (Browser and Node.js). See how to get started in the usage section.
Fast: Up to 2x faster than Spacy sentencization, see the benchmark.
Multilingual: NNSplit currently has models for 9 different languages (German, English, French, Norwegian, Swedish, Simplified Chinese, Turkish, Russian and Ukrainian). Try them in the demo.

Documentation has moved to the NNSplit website: https://bminixhofer.github.io/nnsplit.

License

NNSplit is licensed under the MIT license.

Dependencies

~2–13MB
~180K SLoC