4 releases (breaking)

0.4.0 Nov 8, 2024
0.3.0 Nov 4, 2024
0.2.0 Oct 23, 2024
0.1.0 Oct 22, 2024

#1670 in Web programming

35 downloads per month

Custom license

23KB
665 lines

scraper_query

crates.io

scraper_query is a simple tool for you to query components in HTML documents with scraper so that you can easily do simple HTML manipulations, which are common in web crawling and web scraping and data cleaning.

Usage

use scraper::Html;
use scraper_query::*; // use `HTMLIndex`, `Tag`, `class`, `id`
use markup5ever::interface::tree_builder::TreeSink;

let mut document = Html::parse_document(HTML);
let index = HTMLIndex::new(&document);
// find all nodes with class "foo" and "bar"
let node_ids = index.query(class("foo") & class("bar"));
// find all nodes with id "foo"
let node_ids = index.query(id("foo"));  
// find all nodes with tag "h1" and class "foo"
let node_ids = index.query(Tag::H1 & class("foo"));  // same as `Tag::H1.and(class("foo"))`
// find all nodes with tag "h1" and not class "foo"
let node_ids = index.query(Tag::H1 & (!class("foo")));
// simple manipulation
for id in node_ids {
    document.remove_from_parent(&id);
}

License

MIT

Dependencies

~61MB
~1M SLoC