#extract #density #content #text #paper #scraping #dom-text-density

dom-content-extraction

Rust implementation of Content extraction via text density paper

6 releases

0.2.11 Sep 3, 2023
0.2.8 Apr 22, 2023
0.1.0 Mar 25, 2023

#6 in #density

41 downloads per month

MPL-2.0 license

19KB
349 lines

dom-content-extraction

Rust implementation of Fei Sun, Dandan Song and Lejian Liao paper:

Content Extraction via Text Density (CETD)

use dom_content_extraction::{DensityTree, get_node_text};

let dtree = DensityTree::from_document(&document); // &scraper::Html 
let sorted_nodes = dtree.sorted_nodes();
let node_id = sorted_nodes.last().unwrap().node_id;

println!("{}", get_node_text(node_id, &document));

Read documentation on docs.rs

Dependencies

~3.5–9.5MB
~80K SLoC