#selector #query #class #operation

htmls

This is a library for parsing HTML

2 releases

Uses new Rust 2024

new 0.1.4 Apr 26, 2025
0.1.3 Apr 26, 2025
0.1.2 Apr 26, 2025
0.1.1 Apr 26, 2025
0.1.0 Apr 26, 2025

#428 in Text processing

22 downloads per month

Apache-2.0

120KB
3K SLoC

HTML Selector

An HTML content extraction tool similar to CSS selectors that can precisely extract and process the required content from HTML documents through a simple and intuitive query language.

Example

use query::Query;

fn main() {
    let html = r#"
    <div class="a">
        <p>text 1</p>
        <p>text 2</p>
        <p>text 3</p>
    </div>
    <div class="b">
        <p>text 4</p>
        <p>text 5</p>
        <p>text 6</p>
    </div>
    "#;
    let q = Query::new(html);
    let result = q.query(r#"(class a > tag p:1:2 | class b > tag p:0) > text @replace," ","""#).texts();
    println!("{:?}", result); // ["text2", "text3", "text4"]
}

Basic Selectors

Selector Syntax Description
Class Selector class className Select elements with the specified class name
ID Selector id IDValue Select elements with the specified ID
Tag Selector tag tagName Select elements with the specified HTML tag
Attribute Selector attr "attributeName" Select elements with the specified attribute
Attribute Value Selector attr "attributeName" "value" Select elements with the attribute matching the specified value

Text Extraction

Operation Syntax Description
Text Content text Extract the text content of elements
Link Address href Extract the href attribute value of elements
Image Address src Extract the src attribute value of elements

Pipeline Operations

The pipeline operator > is used to connect multiple selectors for layer-by-layer querying:

class main > tag p > text

This query first selects elements with the class "main", then finds p tags within them, and finally extracts the text content of these p tags.

Regular Expression Matching

Using the tilde ~ allows for regular expression matching:

class ~".*ain" > tag p > text

This query selects elements with class names matching the regular expression .*ain, for example, it can match "main", "again", etc.

Index Selection

You can add an index after the selector to select elements at specific positions:

Index Syntax Example Description
Single Index class a:1 Select the 2nd element (index starts from 0)
Range Index class a:1:3 Select elements with indices from 1 to 3
Range with Step class a:1:10:2 Select elements with indices from 1 to 10 with a step of 2
Multiple Indices class a:1,3,5 Select elements with indices 1, 3, and 5

Text Processing Functions

Using the @ symbol can invoke built-in text processing functions:

Function Syntax Description
trim text @trim Remove whitespace from both ends of the text
replace text @replace,A,B Replace A with B in the text
format text @format,"{}value" Format the text with the specified template
join text @join,"," Join multiple texts with the specified separator
lowercase text @lowercase Convert text to lowercase
uppercase text @uppercase Convert text to uppercase
contains text @contains,A Get the string containing a certain substring
starts_with text @starts_with,A Get the string whose beginning contains a certain substring
starts_with text @ends_with,A Get the string whose ending contains a certain substring

Multiple functions can be chained:

class main > text @trim @replace,A,a @lowercase

Set Operations

Operator Example Description
| expr1 | expr2 Union, merge two selection results
& expr1 & expr2 Intersection, get common elements from two results
^ expr1 ^ expr2 Difference, exclude elements of expr2 from expr1

Complex set operations can be grouped with parentheses:

(((class a ^ class c) | class b) > tag a | class main > tag a) > text @trim

Dependencies

~3.5–9MB
~84K SLoC