6 releases

0.3.1 Sep 1, 2024
0.3.0 Jan 7, 2024
0.2.2 Sep 16, 2019
0.1.0 Sep 14, 2019

#8 in #wikipedia

Download history 1/week @ 2025-05-05 16/week @ 2025-05-12

272 downloads per month
Used in tantivy-object-store

MIT license

22KB
364 lines

This crate can process Mediawiki dump (backup) files in XML format and allow you to extract whatever data you desire.

Example

use wikidump::{config, Parser};

let parser = Parser::new().use_config(config::wikipedia::english());
let site = parser
    .parse_file("tests/enwiki-articles-partial.xml")
    .expect("Could not parse wikipedia dump file.");

assert_eq!(site.name, "Wikipedia");
assert_eq!(site.url, "https://en.wikipedia.org/wiki/Main_Page");
assert!(!site.pages.is_empty());

for page in site.pages {
    println!("\nTitle: {}", page.title);

    for revision in page.revisions {
        println!("\t{}", revision.text);
    }
}

wikidump

This crate processes Mediawiki XML dump files and turns them into easily consumed pieces of data for language analysis, natural langauge processing, and other applications.

Example

let parser = Parser::new()
    .use_config(config::wikipedia::english());

let site = parser
    .parse_file("tests/enwiki-articles-partial.xml")
    .expect("Could not parse wikipedia dump file.");

assert_eq!(site.name, "Wikipedia");
assert_eq!(site.url, "https://en.wikipedia.org/wiki/Main_Page");
assert!(!site.pages.is_empty());

for page in site.pages {
    println!("Title: {}", page.title);

    for revision in page.revisions {
        println!("\t{}", revision.text);
    }
}

Dependencies

~4MB
~66K SLoC