6 releases
0.3.1 | Sep 1, 2024 |
---|---|
0.3.0 | Jan 7, 2024 |
0.2.2 | Sep 16, 2019 |
0.1.0 | Sep 14, 2019 |
#422 in Text processing
Used in tantivy-object-store
22KB
364 lines
wikidump
This crate processes Mediawiki XML dump files and turns them into easily consumed pieces of data for language analysis, natural langauge processing, and other applications.
Example
let parser = Parser::new()
.use_config(config::wikipedia::english());
let site = parser
.parse_file("tests/enwiki-articles-partial.xml")
.expect("Could not parse wikipedia dump file.");
assert_eq!(site.name, "Wikipedia");
assert_eq!(site.url, "https://en.wikipedia.org/wiki/Main_Page");
assert!(!site.pages.is_empty());
for page in site.pages {
println!("Title: {}", page.title);
for revision in page.revisions {
println!("\t{}", revision.text);
}
}
lib.rs
:
This crate can process Mediawiki dump (backup) files in XML format and allow you to extract whatever data you desire.
Example
use wikidump::{config, Parser};
let parser = Parser::new().use_config(config::wikipedia::english());
let site = parser
.parse_file("tests/enwiki-articles-partial.xml")
.expect("Could not parse wikipedia dump file.");
assert_eq!(site.name, "Wikipedia");
assert_eq!(site.url, "https://en.wikipedia.org/wiki/Main_Page");
assert!(!site.pages.is_empty());
for page in site.pages {
println!("\nTitle: {}", page.title);
for revision in page.revisions {
println!("\t{}", revision.text);
}
}
Dependencies
~4MB
~67K SLoC