6 releases
0.2.11 | Sep 3, 2023 |
---|---|
0.2.8 | Apr 22, 2023 |
0.1.0 | Mar 25, 2023 |
#6 in #density
41 downloads per month
19KB
349 lines
dom-content-extraction
Rust implementation of Fei Sun, Dandan Song and Lejian Liao paper:
Content Extraction via Text Density (CETD)
use dom_content_extraction::{DensityTree, get_node_text};
let dtree = DensityTree::from_document(&document); // &scraper::Html
let sorted_nodes = dtree.sorted_nodes();
let node_id = sorted_nodes.last().unwrap().node_id;
println!("{}", get_node_text(node_id, &document));
Dependencies
~3.5–9.5MB
~80K SLoC