#html #parser #dom #pest #json #html-xml

html_parser

A simple and general purpose html/xhtml parser

13 unstable releases

0.7.0 May 11, 2023
0.6.3 Apr 21, 2022
0.6.2 Dec 21, 2020
0.5.0 Nov 28, 2020
0.1.0 May 20, 2020

#2 in #pest

Download history 2553/week @ 2023-12-11 2015/week @ 2023-12-18 1833/week @ 2023-12-25 2182/week @ 2024-01-01 2779/week @ 2024-01-08 2310/week @ 2024-01-15 2253/week @ 2024-01-22 2181/week @ 2024-01-29 2153/week @ 2024-02-05 2776/week @ 2024-02-12 2917/week @ 2024-02-19 3129/week @ 2024-02-26 3000/week @ 2024-03-04 2227/week @ 2024-03-11 2557/week @ 2024-03-18 3229/week @ 2024-03-25

11,238 downloads per month
Used in 51 crates (22 directly)

MIT license

55KB
591 lines

Html parser

A simple and general purpose html/xhtml parser lib/bin, using Pest.

Features

  • Parse html & xhtml (not xml processing instructions)
  • Parse html-documents
  • Parse html-fragments
  • Parse empty documents
  • Parse with the same api for both documents and fragments
  • Parse custom, non-standard, elements; <cat/>, <Cat/> and <C4-t/>
  • Removes comments
  • Removes dangling elements
  • Iterate over all nodes in the dom three

What is it not

  • It's not a high-performance browser-grade parser
  • It's not suitable for html validation
  • It's not a parser that includes element selection or dom manipulation

If your requirements matches any of the above, then you're most likely looking for one of the crates below:

Examples bin

Parse html file

html_parser index.html

Parse stdin with pretty output

curl <website> | html_parser -p

Examples lib

Parse html document

    use html_parser::Dom;

    fn main() {
        let html = r#"
            <!doctype html>
            <html lang="en">
                <head>
                    <meta charset="utf-8">
                    <title>Html parser</title>
                </head>
                <body>
                    <h1 id="a" class="b c">Hello world</h1>
                    </h1> <!-- comments & dangling elements are ignored -->
                </body>
            </html>"#;

        assert!(Dom::parse(html).is_ok());
    }

Parse html fragment

    use html_parser::Dom;

    fn main() {
        let html = "<div id=cat />";
        assert!(Dom::parse(html).is_ok());
    }

Print to json

    use html_parser::{Dom, Result};

    fn main() -> Result<()> {
        let html = "<div id=cat />";
        let json = Dom::parse(html)?.to_json_pretty()?;
        println!("{}", json);
        Ok(())
    }

Dependencies

~2.6–3.5MB
~75K SLoC