5 releases

0.2.1 Jun 24, 2022
0.2.0 Jun 11, 2022
0.1.2 Nov 20, 2021
0.1.1 Nov 20, 2021
0.1.0 Nov 20, 2021

#77 in Parser tooling

Download history 12/week @ 2022-03-13 9/week @ 2022-03-20 3/week @ 2022-04-03 10/week @ 2022-04-24 18/week @ 2022-05-01 18/week @ 2022-05-08 51/week @ 2022-05-15 14/week @ 2022-05-22 22/week @ 2022-05-29 25/week @ 2022-06-05 18/week @ 2022-06-12 56/week @ 2022-06-19 26/week @ 2022-06-26

129 downloads per month

MIT license

145KB
3K SLoC

Skyscraper - HTML scraping with XPath

Dependency Status License MIT Crates.io doc.rs

Rust library to scrape HTML documents with XPath expressions.

HTML Parsing

Skyscraper has its own HTML parser implementation. The parser outputs a tree structure that can be traversed manually with parent/child relationships.

Example: Simple HTML Parsing

use skyscraper::html::{self, parse::ParseError};
let html_text = r##"
<html>
    <body>
        <div>Hello world</div>
    </body>
</html>"##;
 
let document = html::parse(html_text)?;

Example: Traversing Parent/Child Relationships

// Parse the HTML text into a document
let text = r#"<parent><child/><child/></parent>"#;
let document = html::parse(text)?;
 
// Get the children of the root node
let parent_node: DocumentNode = document.root_node;
let children: Vec<DocumentNode> = parent_node.children(&document).collect();
assert_eq!(2, children.len());
 
// Get the parent of both child nodes
let parent_of_child0: DocumentNode = children[0].parent(&document).expect("parent of child 0 missing");
let parent_of_child1: DocumentNode = children[1].parent(&document).expect("parent of child 1 missing");
 
assert_eq!(parent_node, parent_of_child0);
assert_eq!(parent_node, parent_of_child1);

XPath Expressions

Skyscraper is capable of parsing XPath strings and applying them to HTML documents.

use skyscraper::{html, xpath};
// Parse the html text into a document.
let html_text = r##"
<div>
    <div class="foo">
        <span>yes</span>
    </div>
    <div class="bar">
        <span>no</span>
    </div>
</div>
"##;
let document = html::parse(html_text)?;
 
// Parse and apply the xpath.
let expr = xpath::parse("//div[@class='foo']/span")?;
let results = expr.apply(&document)?;
assert_eq!(1, results.len());
 
// Get text from the node
let text = results[0].get_text(&document).expect("text missing");
assert_eq!("yes", text);

Dependencies

~0.7–1.3MB
~27K SLoC