11 releases

0.6.4	Apr 24, 2021
0.6.3	Apr 18, 2021
0.5.3	Apr 1, 2021
0.5.1	Mar 31, 2021
0.4.6	Feb 20, 2021

#713 in Science

2,400 downloads per month
Used in 2 crates

MIT/Apache

360KB
9K SLoC

nlprule-build

This crate provides a builder to make it easier to use the correct binaries for nlprule. It also provides:

Utility functions to download the binaries from their distribution source.
Scripts to create the nlprule build directories.

Development

If you are using a development version of nlprule, the builder can build the binaries itself (instead of just fetching them):

let nlprule_builder = nlprule_build::BinaryBuilder::new(
    &["en"],
    std::env::var("OUT_DIR").expect("OUT_DIR is set when build.rs is running"),
)
// this specifies that the binaries should be built if they are not found
.fallback_to_build_dir(true)
.build()
.validate();

In that case, you should set

[profile.dev]
build-override = { opt-level = 2 }

in your Cargo.toml. Building can be slow otherwise.

The following has information how to acquire the nlpruile build directories and how to build and test the nlprule binaries. As a user you will typically not need to do this.

Building and testing the nlprule binaries

Building the nlprule binaries requires the build directory for the corresponding language. The latest build directories are stored on Backblaze B2. Download them from https://f000.backblazeb2.com/file/nlprule/en.zip (adjusting the two-letter language code accordingly for other languages).

See Making the build directory for information on how to create a build directory yourself.

The binaries can then be built with the compile target e. g.:

RUST_LOG=INFO cargo run --all-features --bin compile -- \
    --build-dir data/en \
    --tokenizer-out storage/en_tokenizer.bin \
    --rules-out storage/en_rules.bin

This is expected to warn about errors in the Rules since not all grammar rules are supported but should not report any errors in the Tokenizer.

Tests are contained in the binaries. To test the tokenizer binary, run e. g.:

RUST_LOG=WARN cargo run --all-features --bin test_disambiguation -- --tokenizer storage/en_tokenizer.bin

To test the grammar rule binary, run e. g.:

RUST_LOG=WARN cargo run --all-features --bin test -- --tokenizer storage/en_tokenizer.bin --rules storage/en_rules.bin

Making the build directory

nlprule needs build files to build the rule and tokenizer binaries. These build files contain e. g. the XML files for grammar and disambiguation rules, a dictionary with words and their associated part-of-speech tags / lemmas and some data used for optimizations. Collectively, they form the build directory. Each language has a separate build directory.

The build directory for a language can be generated with make_build_dir.py. Run python make_build_dir.py --help (or take a look at the source code) for more information.

Below are the commands used to make the build directories for nlprule's supported languages (of course, the paths need to be adjusted depending on your setup):

English

python build/make_build_dir.py \
    --lt_dir=$LT_PATH \
    --lang_code=en \
    --tag_dict_path=$LT_PATH/org/languagetool/resource/en/english.dict \
    --tag_info_path=$LT_PATH/org/languagetool/resource/en/english.info \
    --chunker_token_model=$HOME/Downloads/nlprule/en-token.bin \
    --chunker_pos_model=$HOME/Downloads/nlprule/en-pos-maxent.bin \
    --chunker_chunk_model=$HOME/Downloads/nlprule/en-chunker.bin \
    --out_dir=data/en

Chunker binaries can be downloaded from http://opennlp.sourceforge.net/models-1.5/.

German

python build/make_build_dir.py \
    --lt_dir=$LT_PATH \
    --lang_code=de \
    --tag_dict_path=$HOME/Downloads/nlprule/german-pos-dict/src/main/resources/org/languagetool/resource/de/german.dict \
    --tag_info_path=$HOME/Downloads/nlprule/german-pos-dict/src/main/resources/org/languagetool/resource/de/german.info \
    --out_dir=data/de

The POS dict can be downloaded from https://github.com/languagetool-org/german-pos-dict.

Spanish

python build/make_build_dir.py \
    --lt_dir=$LT_PATH \
    --lang_code=es \
    --tag_dict_path=$HOME/Downloads/nlprule/spanish-pos-dict/org/languagetool/resource/es/es-ES.dict \
    --tag_info_path=$HOME/Downloads/nlprule/spanish-pos-dict/org/languagetool/resource/es/es-ES.info \
    --out_dir=data/es

Note for Spanish: disambiguation.xml is currently manually postprocessed by removing an invalid <marker> in POS_N and changing one rule (commit). grammar.xml is manually postprocessed by fixing the match reference for EN_TORNO. These issues will be fixed in the next LanguageTool release.

The POS dict can be downloaded from https://mvnrepository.com/artifact/org.softcatala/spanish-pos-dict (download the latest version and unzip the .jar).

Dependencies

~15–32MB
~420K SLoC