#language #nlp

bin+lib whichlang

A blazingly fast and lightweight language detection library for Rust

1 unstable release

0.1.0 May 10, 2023

#187 in Internationalization (i18n)

Download history 722/week @ 2024-03-14 1055/week @ 2024-03-21 750/week @ 2024-03-28 827/week @ 2024-04-04 541/week @ 2024-04-11 783/week @ 2024-04-18 591/week @ 2024-04-25 453/week @ 2024-05-02 562/week @ 2024-05-09 473/week @ 2024-05-16 388/week @ 2024-05-23 468/week @ 2024-05-30 464/week @ 2024-06-06 537/week @ 2024-06-13 581/week @ 2024-06-20 262/week @ 2024-06-27

1,898 downloads per month
Used in 2 crates

MIT license

745KB
4.5K SLoC

Whichlang

This is a language detection library, aiming for both precision and performance.

Features

  • No dependency
  • Throughput above 100 MB/s for short and long strings.
  • Good accuracy (99.5% on my validation dataset, but it really depends on the size of your input.)

How does it work?

It uses a multiclass logistic regression model over:

  • 2, 3, 4-grams of letters on ASCII
  • codepoint / 128
  • a slightly smarter projection of codepoints over a given class.

We use the hashing trick and project these features over a space of size 4_096.

The logistic regression is trained in the python notebook attached, and used to generate weight.rs.

No runtime deps