#tokenizer #normalize #language #segmenter

charabia

A simple library to detect the language, tokenize the text and normalize the tokens

7 releases (4 breaking)

0.7.1 Feb 20, 2023
0.7.0 Dec 15, 2022
0.6.0 Aug 22, 2022
0.5.1 Jul 5, 2022
0.3.0 Apr 28, 2022

#249 in Text processing

Download history 1870/week @ 2022-12-01 1593/week @ 2022-12-08 1624/week @ 2022-12-15 1446/week @ 2022-12-22 1375/week @ 2022-12-29 1497/week @ 2023-01-05 1584/week @ 2023-01-12 1627/week @ 2023-01-19 1622/week @ 2023-01-26 1535/week @ 2023-02-02 1890/week @ 2023-02-09 1866/week @ 2023-02-16 1848/week @ 2023-02-23 2054/week @ 2023-03-02 2276/week @ 2023-03-09 2236/week @ 2023-03-16

8,710 downloads per month
Used in 2 crates

MIT license

1MB
3.5K SLoC

Rust 3K SLoC // 0.1% comments F* 563 SLoC // 0.7% comments Shell 7 SLoC

Charabia library tokenize a text detecting the Script/Language, segmenting, normalizing, and classifying it.

Examples

Tokenization

use charabia::Tokenize;

let orig = "Thé quick (\"brown\") fox can't jump 32.3 feet, right? Brr, it's 29.3°F!";

// tokenize the text.
let mut tokens = orig.tokenize();

let token = tokens.next().unwrap();
// the lemma into the token is normalized: `Thé` became `the`.
assert_eq!(token.lemma(), "the");
// token is classfied as a word
assert!(token.is_word());

let token = tokens.next().unwrap();
assert_eq!(token.lemma(), " ");
// token is classfied as a separator
assert!(token.is_separator());

Segmentation

use charabia::Segment;

let orig = "The quick (\"brown\") fox can't jump 32.3 feet, right? Brr, it's 29.3°F!";

let mut segments = orig.segment_str();

assert_eq!(segments.next(), Some("The"));
assert_eq!(segments.next(), Some(" "));
assert_eq!(segments.next(), Some("quick"));

Build features

Charabia comes with default features that can be deactivated at compile time, this features are additional Language supports that need to download and/or build a specialized dictionary that impact the compilation time. Theses features are listed in charabia's cargo.toml and can be deactivated via dependency features.

Dependencies

~6–23MB
~174K SLoC