#web #scraping

html-extractor

A Rust crate for extracting data from HTML

7 releases (1 stable)

1.0.0 Aug 8, 2020
0.4.0 Apr 30, 2020
0.3.0 Apr 30, 2020
0.2.1 Apr 29, 2020
0.1.1 Apr 27, 2020

#1429 in Web programming

Download history 41/week @ 2023-10-22 44/week @ 2023-10-29 33/week @ 2023-11-05 36/week @ 2023-11-12 44/week @ 2023-11-19 65/week @ 2023-11-26 27/week @ 2023-12-03 28/week @ 2023-12-10 37/week @ 2023-12-17 39/week @ 2023-12-24 24/week @ 2023-12-31 33/week @ 2024-01-07 34/week @ 2024-01-14 28/week @ 2024-01-21 29/week @ 2024-01-28 32/week @ 2024-02-04

131 downloads per month
Used in 4 crates

MIT license

26KB
277 lines

html-extractor

Rust html-extractor at crates.io html-extractor at docs.rs

A Rust crate for extracting data from HTML.

Examples

Extracting a simple value from HTML

use html_extractor::{html_extractor, HtmlExtractor};
html_extractor! {
    #[derive(Debug, PartialEq)]
    Foo {
        foo: usize = (text of "#foo"),
    }
}

fn main() {
    let input = r#"
        <div id="foo">1</div>
    "#;
    let foo = Foo::extract_from_str(input).unwrap();
    assert_eq!(foo, Foo { foo: 1 });
}

Extracting a collection from HTML

use html_extractor::{html_extractor, HtmlExtractor};
html_extractor! {
    #[derive(Debug, PartialEq)]
    Foo {
        foo: Vec<usize> = (text of ".foo", collect),
    }
}

fn main() {
    let input = r#"
        <div class="foo">1</div>
        <div class="foo">2</div>
        <div class="foo">3</div>
        <div class="foo">4</div>
    "#;
    let foo = Foo::extract_from_str(input).unwrap();
    assert_eq!(foo, Foo { foo: vec![1, 2, 3, 4] });
}

Extracting with regex

use html_extractor::{html_extractor, HtmlExtractor};
html_extractor! {
    #[derive(Debug, PartialEq)]
    Foo {
        (foo: usize,) = (text of "#foo", capture with "^foo=(.*)$"),
    }
}

fn main() {
    let input = r#"
        <div id="foo">foo=1</div>
    "#;
    let foo = Foo::extract_from_str(input).unwrap();
    assert_eq!(foo, Foo { foo: 1 });
}

Changelog

v0.4.0

  • Add presence of .. target specifier

v0.3.0

  • Add parser specifier
  • Add inner_html target specifier
  • Change the behavior when extracting text nodes to remove spaces at both ends.
  • Fix error message

v0.2.1

  • Fix the internal usage of the rust standard library

v0.2.0

  • Rename "collect specifier" to "collector specifier"
  • Add "optional" collector

v0.1.1

  • Fix the links in the documentation

Dependencies

~7.5MB
~145K SLoC