1 stable release
| 1.0.0 | Nov 18, 2024 |
|---|
#6 in #json-lines
Used in 2 crates
(via wiki_corpus_parser)
65KB
1.5K
SLoC
Wiki-Corpus
Extract texts as JsonLine from Wikipedia dump (.bz2).
Quick Start
-
install the
wiki-corpuscrate.cargo install wiki-corpus -
prepare wikipedia dump.
https://dumps.wikimedia.org/enwiki/latest/
-> download
enwiki-latest-pages-articles-multistream.xml.bz2 -
convert the bz2 file.
wiki-corpus --input <PATH/TO/enwiki-latest-pages-articles-multistream.xml.bz2>
Releases
v1.0.0
- First release.
Dependencies
~7–12MB
~205K SLoC