7 releases (4 breaking)
0.7.1 | Feb 20, 2023 |
---|---|
0.7.0 | Dec 15, 2022 |
0.6.0 | Aug 22, 2022 |
0.5.1 | Jul 5, 2022 |
0.3.0 | Apr 28, 2022 |
#249 in Text processing
8,710 downloads per month
Used in 2 crates
1MB
3.5K
SLoC
Charabia library tokenize a text detecting the Script/Language, segmenting, normalizing, and classifying it.
Examples
Tokenization
use charabia::Tokenize;
let orig = "Thé quick (\"brown\") fox can't jump 32.3 feet, right? Brr, it's 29.3°F!";
// tokenize the text.
let mut tokens = orig.tokenize();
let token = tokens.next().unwrap();
// the lemma into the token is normalized: `Thé` became `the`.
assert_eq!(token.lemma(), "the");
// token is classfied as a word
assert!(token.is_word());
let token = tokens.next().unwrap();
assert_eq!(token.lemma(), " ");
// token is classfied as a separator
assert!(token.is_separator());
Segmentation
use charabia::Segment;
let orig = "The quick (\"brown\") fox can't jump 32.3 feet, right? Brr, it's 29.3°F!";
let mut segments = orig.segment_str();
assert_eq!(segments.next(), Some("The"));
assert_eq!(segments.next(), Some(" "));
assert_eq!(segments.next(), Some("quick"));
Build features
Charabia comes with default features that can be deactivated at compile time,
this features are additional Language supports that need to download and/or build a specialized dictionary that impact the compilation time.
Theses features are listed in charabia's cargo.toml
and can be deactivated via dependency features.
Dependencies
~6–23MB
~174K SLoC