#html-parser #table #html #parse #content

table-extract

Utility for extracting data from HTML tables

5 releases

0.2.3 Apr 22, 2023
0.2.2 Nov 3, 2019
0.2.1 Jul 9, 2017
0.2.0 Jul 8, 2017
0.1.0 Jul 7, 2017

#2826 in Parser implementations

Download history 121/week @ 2024-07-19 133/week @ 2024-07-26 131/week @ 2024-08-02 98/week @ 2024-08-09 61/week @ 2024-08-16 174/week @ 2024-08-23 122/week @ 2024-08-30 146/week @ 2024-09-06 180/week @ 2024-09-13 117/week @ 2024-09-20 138/week @ 2024-09-27 241/week @ 2024-10-04 165/week @ 2024-10-11 112/week @ 2024-10-18 112/week @ 2024-10-25 97/week @ 2024-11-01

536 downloads per month
Used in uupdump

MIT license

23KB
441 lines

TableExtract

TableExtract is a Rust library for extracting data from HTML tables. It is inspired by Perl's HTML::TableExtract.

Check out the crate documentation for more information.

Usage

TableExtract is on crates.io. To use it, just add this to your Cargo.toml:

[dependencies]
table-extract = "0.2"

Contributing

Contributions are welcome! There are two things to keep in mind:

  1. This project uses the stable Rust toolchain from rustup.
  2. This project uses cargo fmt to keep the code tidy.

License

© 2019 Mitchell Kember

TableExtract is available under the MIT License; see LICENSE for details.


lib.rs:

Utility for extracting data from HTML tables.

This library allows you to parse tables from HTML documents and iterate over their rows. There are three entry points:

Each of these returns an Option<Table>, since there might not be any matching table in the HTML. Once you have a table, you can iterate over it and access the contents of each Row.

Examples

Here is a simple example that uses Table::find_first to print the cells in each row of a table:

let html = r#"
    <table>
        <tr><th>Name</th><th>Age</th></tr>
        <tr><td>John</td><td>20</td></tr>
    </table>
"#;
let table = table_extract::Table::find_first(html).unwrap();
for row in &table {
    println!(
        "{} is {} years old",
        row.get("Name").unwrap_or("<name missing>"),
        row.get("Age").unwrap_or("<age missing>")
    )
}

If the document has multiple tables, we can use Table::find_by_headers to identify the one we want:

let html = r#"
    <table></table>
    <table>
        <tr><th>Name</th><th>Age</th></tr>
        <tr><td>John</td><td>20</td></tr>
    </table>
"#;
let table = table_extract::Table::find_by_headers(html, &["Age"]).unwrap();
for row in &table {
    for cell in row {
        println!("Table cell: {}", cell);
    }
}

Dependencies

~3.5–9MB
~84K SLoC