#wikipedia #json-lines #dump #convert #bz2 #wikimedia #xml-format

app wiki_corpus

Extract text from Wikipedia dumps (.bz2) and convert it to JSONLines format

1 stable release

1.0.0 Nov 18, 2024

#4 in #wikimedia

MIT license

22KB
247 lines

Wiki-Corpus

Extract texts as JsonLine from Wikipedia dump (.bz2).

Quick Start

  1. install the wiki-corpus crate.

    cargo install wiki-corpus
    
  2. prepare wikipedia dump.

    https://dumps.wikimedia.org/enwiki/latest/

    -> download enwiki-latest-pages-articles-multistream.xml.bz2

  3. convert the bz2 file.

wiki-corpus --input <PATH/TO/enwiki-latest-pages-articles-multistream.xml.bz2>

Releases

v1.0.0

  • First release.

Dependencies

~12–22MB
~295K SLoC