6 releases
0.3.1 | Sep 1, 2024 |
---|---|
0.3.0 | Jan 7, 2024 |
0.2.2 | Sep 16, 2019 |
0.1.0 | Sep 14, 2019 |
#8 in #wikipedia
272 downloads per month
Used in tantivy-object-store
22KB
364 lines
This crate can process Mediawiki dump (backup) files in XML format and allow you to extract whatever data you desire.
Example
use wikidump::{config, Parser};
let parser = Parser::new().use_config(config::wikipedia::english());
let site = parser
.parse_file("tests/enwiki-articles-partial.xml")
.expect("Could not parse wikipedia dump file.");
assert_eq!(site.name, "Wikipedia");
assert_eq!(site.url, "https://en.wikipedia.org/wiki/Main_Page");
assert!(!site.pages.is_empty());
for page in site.pages {
println!("\nTitle: {}", page.title);
for revision in page.revisions {
println!("\t{}", revision.text);
}
}
wikidump
This crate processes Mediawiki XML dump files and turns them into easily consumed pieces of data for language analysis, natural langauge processing, and other applications.
Example
let parser = Parser::new()
.use_config(config::wikipedia::english());
let site = parser
.parse_file("tests/enwiki-articles-partial.xml")
.expect("Could not parse wikipedia dump file.");
assert_eq!(site.name, "Wikipedia");
assert_eq!(site.url, "https://en.wikipedia.org/wiki/Main_Page");
assert!(!site.pages.is_empty());
for page in site.pages {
println!("Title: {}", page.title);
for revision in page.revisions {
println!("\t{}", revision.text);
}
}
Dependencies
~4MB
~66K SLoC