#pytorch #deep-learning #machine-learning #tokenizer #sentencizer

nnsplit

Fast, robust sentence splitting with bindings for Python, Rust and Javascript

20 unstable releases (3 breaking)

new 0.4.12 Sep 22, 2020
0.4.11 Sep 21, 2020
0.3.4 Sep 5, 2020
0.3.1 Jul 17, 2020
0.1.0 Feb 10, 2020

#32 in Machine learning

Download history 3/week @ 2020-06-06 3/week @ 2020-06-13 4/week @ 2020-06-27 3/week @ 2020-07-04 13/week @ 2020-07-11 26/week @ 2020-07-18 4/week @ 2020-07-25 2/week @ 2020-08-01 10/week @ 2020-08-08 11/week @ 2020-08-15 1/week @ 2020-08-22 5/week @ 2020-08-29 63/week @ 2020-09-05 5/week @ 2020-09-12 142/week @ 2020-09-19

87 downloads per month

MIT license

26KB
569 lines

NNSplit

PyPI Crates.io npm CI License

Fast, robust sentence splitting with bindings for Python, Rust and Javascript.

Features

  • Robust: Does not depend on proper punctuation and casing to split text into sentences.
  • Small: NNSplit uses a byte-level LSTM, so weights are very small which makes it easy to run in the browser.
  • Portable: Models are trained in Python, but inference can be done from Javascript, Rust and Python.
  • Fast: Can run on your GPU to split 10k short texts in less than 400ms in Colab. See train.ipynb.

Pretrained models

NNSplit comes with pretrained models. They were evaluated on the OPUS Open Subtitles dataset by concatenating 2 - 4 sentences and measuring the number of concatenations which are split completely correctly vs. the total number of concatenations.

See evaluate.ipynb for details.

en

NNSplit Spacy (Tagger) Spacy (Sentencizer)
Clean 0.754371 0.853603 0.820934
Partial punctuation 0.485907 0.517829 0.249753
Partial case 0.761754 0.825119 0.819679
Partial punctuation and case 0.443704 0.458619 0.249873
No punctuation and case 0.166273 0.180859 0.00463281

de

NNSplit Spacy (Tagger) Spacy (Sentencizer)
Clean 0.818902 0.833368 0.878471
Partial punctuation 0.463999 0.426458 0.266312
Partial case 0.823565 0.792839 0.876678
Partial punctuation and case 0.447231 0.377201 0.26697
No punctuation and case 0.198165 0.0952267 0.00756195

Python Usage

Installation

NNSplit has onnxruntime as the only dependency.

Install NNSplit with pip: pip install nnsplit

To enable GPU support, install onnxruntime-gpu: pip install onnxruntime-gpu.

Usage

from nnsplit import NNSplit
splitter = NNSplit.load("en")

# returns `Split` objects
splits = splitter.split(["This is a test This is another test."])[0]

# a `Split` can be iterated over to yield smaller splits or stringified with `str(...)`.
for sentence in splits:
   print(sentence)

Javascript Usage

Installation

The Javascript bindings for NNSplit have tractjs as the only dependency.

Install them with npm: npm install nnsplit

Usage

The Javascript API has no method .load(model_name) to load a pretrained model. Instead the path to a model in your file system (in Node.js) or accessable via fetch (in the browser) has to be given as first argument to NNSplit.new. See models to download the model.onnx files for the pretrained models.

Node.js

const nnsplit = require("nnsplit");

async function run() {
    const splitter = await nnsplit.NNSplit.new("path/to/model.onnx");

    let splits = (await splitter.split(["This is a test This is another test."]))[0];
    console.log(splits.parts.map((x) => x.text)); // to log sentences, or x.parts to get the smaller subsplits
}

run()

Browser

NNSplit in the browser currently only works with a bundler and has to be imported asynchronously. API is the same as in Node.js. See bindings/javascript/dev_server for a full example.

Rust Usage

Installation

Add NNSplit as a dependency to your Cargo.toml:

# ...

[dependencies.nnsplit]
version = "<version>"
features = ["model-loader", "tract-backend"] # to automatically download pretrained models and to use tract for inference, respectively

# ...

Usage

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let splitter =
        nnsplit::NNSplit::load("en", nnsplit::NNSplitOptions::default())?;

    let input: Vec<&str> = vec!["This is a test This is another test."];
    let splits = &splitter.split(&input)[0];

    for sentence in splits.iter() {
        println!("{}", sentence.text());
    }

    Ok(())
}

Dependencies

~1.9–4.5MB
~77K SLoC