1 stable release
1.0.0 | Nov 18, 2024 |
---|
#4 in #wikimedia
22KB
247 lines
Wiki-Corpus
Extract texts as JsonLine from Wikipedia dump (.bz2).
Quick Start
-
install the
wiki-corpus
crate.cargo install wiki-corpus
-
prepare wikipedia dump.
https://dumps.wikimedia.org/enwiki/latest/
-> download
enwiki-latest-pages-articles-multistream.xml.bz2
-
convert the bz2 file.
wiki-corpus --input <PATH/TO/enwiki-latest-pages-articles-multistream.xml.bz2>
Releases
v1.0.0
- First release.
Dependencies
~12–22MB
~295K SLoC