4 releases

0.1.4 Jul 17, 2023
0.1.3 Mar 1, 2022
0.1.2 Feb 11, 2022
0.1.1 Feb 11, 2022
0.1.0 Feb 10, 2022

#1067 in Text processing

22 downloads per month

MIT license

26KB
464 lines

cutters

A rule based sentence segmentation library.

Release Docs License Downloads

🚧 This library is experimental. 🚧

Features

  • Full UTF-8 support.
  • Robust parsing.
  • Language specific rules (each defined by its own PEG).
  • Fast and memory efficient parsing via the pest library.
  • Sentences can contain quotes which can contain subsentences.

Bindings

Besides native Rust, bindings for the following programming languages are available:

Supported languages

  • Croatian (standard)
  • English (standard)

There is also an additional Baseline "language" that simply splits the text on sentence terminals as defined by UTF-8. Its intended use is for benchmarking.

Example

After adding the cutters dependency to your Cargo.toml file, usage is simple.

fn main(){
    let text = r#"Petar Krešimir IV. je vladao od 1058. do 1074. St. Louis 9LX je događaj u svijetu šaha. To je prof.dr.sc. Ivan Horvat. Volim rock, punk, funk, pop itd. Tolstoj je napisao: "Sve sretne obitelji nalik su jedna na drugu. Svaka nesretna obitelj nesretna je na svoj način.""#;

    let sentences = cutters::cut(text, cutters::Language::Croatian);

    println!("{:#?}", sentences);
}

This results in the following output (note that the str struct fields are &str).

[
    Sentence {
        str: "Petar Krešimir IV. je vladao od 1058. do 1074. ",
        quotes: [],
    },
    Sentence {
        str: "St. Louis 9LX je događaj u svijetu šaha.",
        quotes: [],
    },
    Sentence {
        str: "To je prof.dr.sc. Ivan Horvat.",
        quotes: [],
    },
    Sentence {
        str: "Volim rock, punk, funk, pop itd.",
        quotes: [],
    },
    Sentence {
        str: "Tolstoj je napisao: \"Sve sretne obitelji nalik su jedna na drugu. Svaka nesretna obitelj nesretna je na svoj način.\"",
        quotes: [
            Quote {
                str: "Sve sretne obitelji nalik su jedna na drugu. Svaka nesretna obitelj nesretna je na svoj način.",
                sentences: [
                    "Sve sretne obitelji nalik su jedna na drugu.",
                    "Svaka nesretna obitelj nesretna je na svoj način.",
                ],
            },
        ],
    },
]

Dependencies

~3MB
~57K SLoC