6 releases

0.2.3 Jun 24, 2023
0.2.2 Jun 24, 2023
0.1.1 Jun 5, 2023

#1313 in Text processing

30 downloads per month

MIT/Apache

390KB
14K SLoC

wordfreq-model

This crate provides a loader for pre-compiled wordfreq models, allowing you to easily create wordfreq instances for various languages.

Documentation

https://docs.rs/wordfreq-model/

Licensing

The source code is licensed under either of

at your option.

The model files are distributed here together with the credits.


lib.rs:

wordfreq-model

This crate provides a loader for pre-compiled wordfreq models, allowing you to easily create WordFreq instances for various languages.

Instructions

The provided models are the same as those distributed in the original Python package. See the original documentation for the supported languages and their sources.

You need to specify models you want to use with features. The feature names are in the form of large-xx or small-xx, where xx is the language code. For example, if you want to use the large-English and small-Japanese models, specify large-en and small-ja as follows:

# Cargo.toml

[dependencies.wordfreq-model]
version = "0.2"
features = ["large-en", "small-ja"]

There is no default feature. Be sure to specify features you want to use.

Examples

load_wordfreq can create a WordFreq instance from a ModelKind enum value. ModelKind will have the specified feature names in CamelCase, such as LargeEn or SmallJa.

By default, only ModelKind::ExampleEn appears for tests.

use approx::assert_relative_eq;
use wordfreq_model::load_wordfreq;
use wordfreq_model::ModelKind;

let wf = load_wordfreq(ModelKind::ExampleEn).unwrap();
assert_relative_eq!(wf.word_frequency("las"), 0.25);
assert_relative_eq!(wf.word_frequency("vegas"), 0.75);
assert_relative_eq!(wf.word_frequency("Las"), 0.25); // Standardized

Standardization

As the above example shows, the model automatically standardizes words before looking them up (i.e., Las is handled as las). This is done by an instance Standardizer set up in the WordFreq instance. load_wordfreq automatically sets up an appropriate Standardizer instance for each language.

Notes

This crate downloads specified model files and embeds the models directly into the source code. Specify as many models as you need to avoid extra downloads and bloating the resulting binary.

The actual model files to be used are placed here together with the credits. If you do not desire automatic model downloads and binary embedding, you can create instances from these files directly. See the instructions in [wordfreq].

Dependencies

~5–9MB
~180K SLoC