#language #word #split #text #match

bin+lib alphabet_detector

Natural language alphabet detection library

7 releases (4 breaking)

new 0.5.0 Apr 22, 2025
0.4.1 Apr 21, 2025
0.3.1 Apr 17, 2025
0.2.1 Mar 2, 2025
0.1.0 Jan 31, 2025

#622 in Text processing

Download history 110/week @ 2025-01-28 19/week @ 2025-02-04 118/week @ 2025-02-11 11/week @ 2025-02-18 145/week @ 2025-02-25 35/week @ 2025-03-04 4/week @ 2025-03-11 135/week @ 2025-04-08 243/week @ 2025-04-15

384 downloads per month
Used in 2 crates

MIT/Apache

305KB
7K SLoC

Alphabet Detector

Crate API

Detects 401 alphabets of 324 languages in 170 scripts

One language can be written in multiple scripts, so it will be detected as a different ScriptLanguage (language + script)

Does not have any models, just matches the alphabet. Not recommended to use as a standalone detector, it's more like a word separator + language prefilter for an actual language detector (Langram).

Splits text (iterator CharIndices) to words, and detects ScriptLanguages (language + script) of words by used letters (chars).

Examples

To split text to the iterator of WordLang:

let word_iter = words::from_ch_ind::<Vec<char>>(text.char_indices());

If you don't need individual words, but just want to analyze a full text:

let (all_words, all_langs) = fulltext_filter_with_margin_sorted::<Vec<char>, 95>(text.char_indices());

It will give you all Words (Vec<Word<Vec<char>>>) of text and Vec<(ScriptLanguage, u32)> filtered with a less then 5% margin for an error.

Instead of Vec<char> you can use other types of words.

Extras

Look at the alphabets.rs to understand what languages have already defined alphabets. Some of them need validation.

Warning: can return words with chars from the Unicode private area (for example Lingala, Nuer or Yoruba languages), because of char normalization (composition with Inherited), and there are no such chars defined in Unicode.

Dependencies

~3–5MB
~84K SLoC