#html #html-document #querying #manipulating #html5ever #python #top

soup

Inspired by the python library BeautifulSoup, this is a layer on top of html5ever that adds a different API for querying and manipulating HTML

8 releases (4 breaking)

Uses old Rust 2015

0.5.1 Mar 25, 2021
0.5.0 Feb 14, 2020
0.4.1 Apr 29, 2019
0.3.0 Nov 14, 2018
0.1.1 Nov 2, 2018

#872 in Text processing

Download history 221/week @ 2023-11-19 209/week @ 2023-11-26 94/week @ 2023-12-03 172/week @ 2023-12-10 185/week @ 2023-12-17 129/week @ 2023-12-24 56/week @ 2023-12-31 218/week @ 2024-01-07 224/week @ 2024-01-14 298/week @ 2024-01-21 204/week @ 2024-01-28 204/week @ 2024-02-04 254/week @ 2024-02-11 285/week @ 2024-02-18 344/week @ 2024-02-25 510/week @ 2024-03-03

1,422 downloads per month
Used in 20 crates (17 directly)

CC-PDDC license

50KB
863 lines

Soup

Inspired by the python library BeautifulSoup, this is a layer on top of html5ever that adds a different API for querying & manipulating HTML

Documentation (latest release)

Documentation (master)

Installation

In order to use, add the following to your Cargo.toml:

[dependencies]
soup = "0.5"

Usage

// src/main.rs
extern crate reqwest;
extern crate soup;

use std::error::Error;

use reqwest;
use soup::prelude::*;

fn main() -> Result<(), Box<Error>> {
    let response = reqwest::get("https://google.com")?;
    let soup = Soup::from_reader(response);
    let some_text = soup.tag("p")
			.attr("class", "hidden")
			.find()
			.and_then(|p| p.text());
    OK(())
}


lib.rs:

Inspired by the Python library "BeautifulSoup," soup is a layer on top of html5ever that aims to provide a slightly different API for querying & manipulating HTML

Examples (inspired by bs4's docs)

Here is the HTML document we will be using for the rest of the examples:

const THREE_SISTERS: &'static str = r#"
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"#;
# fn main() {}

First let's try searching for a tag with a specific name:

# extern crate soup;
# const THREE_SISTERS: &'static str = r#"
# <html><head><title>The Dormouse's story</title></head>
# <body>
# <p class="title"><b>The Dormouse's story</b></p>
#
# <p class="story">Once upon a time there were three little sisters; and their names were
# <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
# <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
# <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>
#
# <p class="story">...</p>
# "#;
# fn main() {
use soup::prelude::*;

let soup = Soup::new(THREE_SISTERS);

let title = soup.tag("title").find().expect("Couldn't find tag 'title'");
assert_eq!(title.display(), "<title>The Dormouse's story</title>");
assert_eq!(title.name(), "title");
assert_eq!(title.text(), "The Dormouse's story".to_string());
assert_eq!(title.parent().expect("Couldn't find parent of 'title'").name(), "head");

let p = soup.tag("p").find().expect("Couldn't find tag 'p'");
assert_eq!(
    p.display(),
    r#"<p class="title"><b>The Dormouse's story</b></p>"#
);
assert_eq!(p.get("class"), Some("title".to_string()));
# }

So we see that .find will give us the first element that matches the query, and we've seen some of the methods that we can call on the results. But what if we want to retrieve more than one element with the query? For that, we'll use .find_all:

# extern crate soup;
# use soup::prelude::*;
# const THREE_SISTERS: &'static str = r#"
# <html><head><title>The Dormouse's story</title></head>
# <body>
# <p class="title"><b>The Dormouse's story</b></p>
#
# <p class="story">Once upon a time there were three little sisters; and their names were
# <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
# <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
# <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>
#
# <p class="story">...</p>
# "#;
# fn main() {
# let soup = Soup::new(THREE_SISTERS);
// .find returns only the first 'a' tag
let a = soup.tag("a").find().expect("Couldn't find tag 'a'");
assert_eq!(
    a.display(),
    r#"<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>"#
);
// but .find_all will return _all_ of them:
let a_s = soup.tag("a").find_all();
assert_eq!(
    a_s.map(|a| a.display())
       .collect::<Vec<_>>()
       .join("\n"),
    r#"<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>"#
);
# }

Since .find_all returns an iterator, you can use it with all the methods you would use with other iterators:

# extern crate soup;
# use soup::prelude::*;
# const THREE_SISTERS: &'static str = r#"
# <html><head><title>The Dormouse's story</title></head>
# <body>
# <p class="title"><b>The Dormouse's story</b></p>
#
# <p class="story">Once upon a time there were three little sisters; and their names were
# <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
# <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
# <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>
#
# <p class="story">...</p>
# "#;
# fn main() {
# let soup = Soup::new(THREE_SISTERS);
let expected = [
    "http://example.com/elsie",
    "http://example.com/lacie",
    "http://example.com/tillie",
];

for (i, link) in soup.tag("a").find_all().enumerate() {
    let href = link.get("href").expect("Couldn't find link with 'href' attribute");
    assert_eq!(href, expected[i].to_string());
}
# }

The top-level structure we've been working with here, soup, implements the same methods that the query results do, so you can call the same methods on it and it will delegate the calls to the root node:

# extern crate soup;
# use soup::prelude::*;
# const THREE_SISTERS: &'static str = r#"
# <html><head><title>The Dormouse's story</title></head>
# <body>
# <p class="title"><b>The Dormouse's story</b></p>
#
# <p class="story">Once upon a time there were three little sisters; and their names were
# <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
# <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
# <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>
#
# <p class="story">...</p>
# "#;
# fn main() {
# let soup = Soup::new(THREE_SISTERS);
let text = soup.text();
assert_eq!(
    text,
    r#"The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.

...
"#
);
# }

You can use more than just strings to search for results, such as Regex:

use regex::Regex;

let soup = Soup::new(r#"<body><p>some text, <b>Some bold text</b></p></body>"#);
let results = soup.tag(Regex::new("^b")?)
                  .find_all()
                  .map(|tag| tag.name().to_string())
                  .collect::<Vec<_>>();
assert_eq!(results, vec!["body".to_string(), "b".to_string()]);

Passing true will match everything:


let soup = Soup::new(r#"<body><p>some text, <b>Some bold text</b></p></body>"#);
let results = soup.tag(true)
                  .find_all()
                  .map(|tag| tag.name().to_string())
                  .collect::<Vec<_>>();
assert_eq!(results, vec![
    "html".to_string(),
    "head".to_string(),
    "body".to_string(),
    "p".to_string(),
    "b".to_string(),
]);

(also, passing false will always return no results, though if that is useful to you, please let me know)

So what can you do once you get the result of a query? Well, for one thing, you can traverse the tree a few different ways. You can ascend the tree:


let soup = Soup::new(r#"<body><p>some text, <b>Some bold text</b></p></body>"#);
let b = soup.tag("b")
            .find()
            .expect("Couldn't find tag 'b'");
let p = b.parent()
         .expect("Couldn't find parent of 'b'");
assert_eq!(p.name(), "p".to_string());
let body = p.parent()
            .expect("Couldn't find parent of 'p'");
assert_eq!(body.name(), "body".to_string());

Or you can descend it:


let soup = Soup::new(r#"<body><ul><li>ONE</li><li>TWO</li><li>THREE</li></ul></body>"#);
let ul = soup.tag("ul")
            .find()
            .expect("Couldn't find tag 'ul'");
let mut li_tags = ul.children().filter(|child| child.is_element());
assert_eq!(li_tags.next().map(|tag| tag.text().to_string()), Some("ONE".to_string()));
assert_eq!(li_tags.next().map(|tag| tag.text().to_string()), Some("TWO".to_string()));
assert_eq!(li_tags.next().map(|tag| tag.text().to_string()), Some("THREE".to_string()));
assert!(li_tags.next().is_none());

Or ascend it with an iterator:


let soup = Soup::new(r#"<body><ul><li>ONE</li><li>TWO</li><li>THREE</li></ul></body>"#);
let li = soup.tag("li").find().expect("Couldn't find tag 'li'");
let mut parents = li.parents();
assert_eq!(parents.next().map(|tag| tag.name().to_string()), Some("ul".to_string()));
assert_eq!(parents.next().map(|tag| tag.name().to_string()), Some("body".to_string()));
assert_eq!(parents.next().map(|tag| tag.name().to_string()), Some("html".to_string()));
assert_eq!(parents.next().map(|tag| tag.name().to_string()), Some("[document]".to_string()));

Dependencies

~3.5–5MB
~97K SLoC