4 stable releases
| 2.0.0 | Feb 24, 2023 |
|---|---|
| 1.2.3 | May 9, 2022 |
| 1.2.1 | Mar 3, 2022 |
| 1.1.1 | Mar 1, 2022 |
| 0.1.0 |
|
#820 in Text processing
30 downloads per month
530KB
4.5K
SLoC
Ungoliant
🕷️ Ungoliant is a high-performance pipeline that provides tools to build corpus generation pipelines from CommonCrawl. 🕷️
It currently is the generation pipeline for OSCAR corpus, from CommonCrawl. Ungoliant is a replacement of goclassy.
Installation
Installing/Compiling the binary
- Via
cargo:cargo install ungoliant - Via
git:cargo install --git https://github.com/oscar-corpus/ungoliant
Ungoliant needs numerous dependencies that should be compiled when installing. However cmake / gcc can be needed as the project uses fasttext-rs.
KenLM feature
The KenLM feature is optional because it relies on unsafe code that can break if the supplied model files are not correct.
To enable it, install KenLM requirements:
apt install -y libboost-all-dev libeigen3-dev
and use cargo install ungoliant --feature kenlm or cargo b --features kenlm if you're building from source.
Getting the language identification file (for fastText):
Use curl https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin -o lid.176.bin.
Usage
The usual way of generating corpora is:
- Fetch the
wet.paths.gzfile from the last CommonCrawl dump and decompress it. - Download the files using the
downloadcommand. - Generate the corpus using the
pipelinecommand (it may take some time). - Head on to oscar-tools for the packaging steps
You can find more information on each command's --help.
ungoliant 2
corpus generation tool.
USAGE:
ungoliant <SUBCOMMAND>
FLAGS:
-h, --help Prints help information
-V, --version Prints version information
SUBCOMMANDS:
download Download a CommonCrawl release
help Prints this message or the help of the given subcommand(s)
pipeline Run pipeline
rebuild Rebuild the corpus for a given language.
Documentation
Ungoliant is not yet on docs.rs: use cargo doc --bins --open to open the documentation.
Head on to OSCAR Documentation for more info about the project.
Dependencies
~31–47MB
~641K SLoC