#pytorch #deep-learning #machine-learning #tokenizer #sentencizer


Fast, robust sentence splitting with bindings for Python, Rust and Javascript

20 unstable releases (3 breaking)

new 0.4.12 Sep 22, 2020
0.4.11 Sep 21, 2020
0.3.4 Sep 5, 2020
0.3.1 Jul 17, 2020
0.1.0 Feb 10, 2020

#32 in Machine learning

Download history 3/week @ 2020-06-06 3/week @ 2020-06-13 4/week @ 2020-06-27 3/week @ 2020-07-04 13/week @ 2020-07-11 26/week @ 2020-07-18 4/week @ 2020-07-25 2/week @ 2020-08-01 10/week @ 2020-08-08 11/week @ 2020-08-15 1/week @ 2020-08-22 5/week @ 2020-08-29 63/week @ 2020-09-05 5/week @ 2020-09-12 142/week @ 2020-09-19

87 downloads per month

MIT license

569 lines


PyPI Crates.io npm CI License

Fast, robust sentence splitting with bindings for Python, Rust and Javascript.


  • Robust: Does not depend on proper punctuation and casing to split text into sentences.
  • Small: NNSplit uses a byte-level LSTM, so weights are very small which makes it easy to run in the browser.
  • Portable: Models are trained in Python, but inference can be done from Javascript, Rust and Python.
  • Fast: Can run on your GPU to split 10k short texts in less than 400ms in Colab. See train.ipynb.

Pretrained models

NNSplit comes with pretrained models. They were evaluated on the OPUS Open Subtitles dataset by concatenating 2 - 4 sentences and measuring the number of concatenations which are split completely correctly vs. the total number of concatenations.

See evaluate.ipynb for details.


NNSplit Spacy (Tagger) Spacy (Sentencizer)
Clean 0.754371 0.853603 0.820934
Partial punctuation 0.485907 0.517829 0.249753
Partial case 0.761754 0.825119 0.819679
Partial punctuation and case 0.443704 0.458619 0.249873
No punctuation and case 0.166273 0.180859 0.00463281


NNSplit Spacy (Tagger) Spacy (Sentencizer)
Clean 0.818902 0.833368 0.878471
Partial punctuation 0.463999 0.426458 0.266312
Partial case 0.823565 0.792839 0.876678
Partial punctuation and case 0.447231 0.377201 0.26697
No punctuation and case 0.198165 0.0952267 0.00756195

Python Usage


NNSplit has onnxruntime as the only dependency.

Install NNSplit with pip: pip install nnsplit

To enable GPU support, install onnxruntime-gpu: pip install onnxruntime-gpu.


from nnsplit import NNSplit
splitter = NNSplit.load("en")

# returns `Split` objects
splits = splitter.split(["This is a test This is another test."])[0]

# a `Split` can be iterated over to yield smaller splits or stringified with `str(...)`.
for sentence in splits:

Javascript Usage


The Javascript bindings for NNSplit have tractjs as the only dependency.

Install them with npm: npm install nnsplit


The Javascript API has no method .load(model_name) to load a pretrained model. Instead the path to a model in your file system (in Node.js) or accessable via fetch (in the browser) has to be given as first argument to NNSplit.new. See models to download the model.onnx files for the pretrained models.


const nnsplit = require("nnsplit");

async function run() {
    const splitter = await nnsplit.NNSplit.new("path/to/model.onnx");

    let splits = (await splitter.split(["This is a test This is another test."]))[0];
    console.log(splits.parts.map((x) => x.text)); // to log sentences, or x.parts to get the smaller subsplits



NNSplit in the browser currently only works with a bundler and has to be imported asynchronously. API is the same as in Node.js. See bindings/javascript/dev_server for a full example.

Rust Usage


Add NNSplit as a dependency to your Cargo.toml:

# ...

version = "<version>"
features = ["model-loader", "tract-backend"] # to automatically download pretrained models and to use tract for inference, respectively

# ...


fn main() -> Result<(), Box<dyn std::error::Error>> {
    let splitter =
        nnsplit::NNSplit::load("en", nnsplit::NNSplitOptions::default())?;

    let input: Vec<&str> = vec!["This is a test This is another test."];
    let splits = &splitter.split(&input)[0];

    for sentence in splits.iter() {
        println!("{}", sentence.text());



~77K SLoC