#pytorch #deep-learning #machine-learning #tokenizer #sentencizer

nnsplit

A tool to split text using a neural network. For sentence boundary detection, compound splitting and more.

27 releases

0.5.7 Mar 16, 2021
0.5.2 Nov 1, 2020
0.3.1 Jul 17, 2020
0.2.2 Feb 26, 2020

#46 in Machine learning

Download history 1/week @ 2020-12-28 24/week @ 2021-01-04 23/week @ 2021-01-18 27/week @ 2021-02-01 50/week @ 2021-02-08 357/week @ 2021-02-15 55/week @ 2021-02-22 43/week @ 2021-03-01 55/week @ 2021-03-08 20/week @ 2021-03-15 62/week @ 2021-03-22 71/week @ 2021-03-29 78/week @ 2021-04-05

217 downloads per month

MIT license

30KB
675 lines

NNSplit

PyPI Crates.io npm CI License

A tool to split text using a neural network. The main application is sentence boundary detection, but e. g. compound splitting for German is also supported.

Features

  • Robust: Not reliant on proper punctuation, spelling and case. See the metrics.
  • Small: NNSplit uses a byte-level LSTM, so weights are small (< 4MB) and models can be trained for every unicode encodable language.
  • Portable: NNSplit is written in Rust with bindings for Rust, Python, and Javascript (Browser and Node.js). See how to get started in the usage section.
  • Fast: Up to 2x faster than Spacy sentencization, see the benchmark.
  • Multilingual: NNSplit currently has models for 7 different languages (German, English, French, Norwegian, Swedish, Simplified Chinese, Turkish). Try them in the demo.

Documentation has moved to the NNSplit website: https://bminixhofer.github.io/nnsplit.

License

NNSplit is licensed under the MIT license.

Dependencies

~2.2–4.5MB
~91K SLoC