#wikipedia #text #extract #dump #convert #bz2 #json-lines #bz2-and-convert

wiki_corpus_grammar

Extract text from Wikipedia dumps (.bz2) and convert it to JSONLines format

1 stable release

1.0.0 Nov 18, 2024

#6 in #json-lines


Used in 2 crates (via wiki_corpus_parser)

MIT license

65KB
1.5K SLoC

Wiki-Corpus

Extract texts as JsonLine from Wikipedia dump (.bz2).

Quick Start

  1. install the wiki-corpus crate.

    cargo install wiki-corpus
    
  2. prepare wikipedia dump.

    https://dumps.wikimedia.org/enwiki/latest/

    -> download enwiki-latest-pages-articles-multistream.xml.bz2

  3. convert the bz2 file.

wiki-corpus --input <PATH/TO/enwiki-latest-pages-articles-multistream.xml.bz2>

Releases

v1.0.0

  • First release.

Dependencies

~7–12MB
~205K SLoC