#pure #port #language #py3langid #language-identifier

py3langid_rs

A high-performance, pure Rust port of py3langid

1 unstable release

0.1.0 Feb 3, 2025

#83 in #pure

Download history 35/week @ 2025-01-28 71/week @ 2025-02-04

106 downloads per month

MIT license

735KB
175 lines

py3langid_rs

A high-performance, pure Rust implementation of language identification, ported from the Python library py3langid.

[!NOTE] This implementation contains minimum functionalities. It lacks server, probability normalization, language subset.

Usage

Add to your project:

cargo add py3langid_rs

Example usage:

use py3langid_rs::LanguageIdentifier; 

fn main() {
    let li = LanguageIdentifier::new();
    println("{:?}", li.classify("This text is in English."));
}

Code above should print ("en", -56.77429).

Performance

AMD Ryzen 9 5950X, rustc 1.84.0 (9fc6b4312 2025-01-07), Ubuntu 22.04 in WSL 2.3.26.0, Windows 11 23H2.

Implementation Lang Slope Median Mean Std. Dev. Speed up (Slope)
py3langid_rs en 29.153 µs 29.158 µs 29.169 µs 213.92 ns 20.954x
py3langid en 610.884 µs 658.544 µs 610.884 µs 161.042 µs 1.0x
py3langid_rs zh 14.521 µs 14.476 µs 14.502 µs 56.782 ns 31.296x
py3langid zh 454.454 µs 489.616 µs 454.454 µs 75.018 µs 1.0x
py3langid_rs jp 20.472 µs 20.415 µs 20.464 µs 149.43 ns 33.969x
py3langid jp 695.421 µs 747.794 µs 695.421 µs 114.144 µs 1.0x

Using custom model

In case you need to convert your own model pickle...

The converted model is uploaded to git, thus normally you don't have to do this. Only do this when there's a model update in the upstream, or you have a customly trained model.


There's no easy way to directly load the original pickle. Thus, we must convert the pickle first.

Set up environment

I'm using uv here due to it's super fast speed, you can also use other package managers.

uv venv
uv sync

Run conversion script

uv run convert_pkl.py path/to/your/model.plzma path/to/output/folder

This would automatically create/overwrite file model.bin in the output folder. Then in rust, load like this:

use py3langid_rs::LanguageIdentifier; 

fn main() {
    let li = LanguageIdentifier::from_lzma_file("path/to/output/folder/model.bin").unwrap();
    println("{:?}", li.classify("This text is in English."));
}

Dependencies