6 releases
0.2.1 | Nov 3, 2024 |
---|---|
0.2.0 | Nov 1, 2024 |
0.1.3 | Oct 25, 2024 |
0.1.1 | Jun 27, 2023 |
#490 in Parser implementations
88 downloads per month
Used in 3 crates
(2 directly)
76KB
1.5K
SLoC
htmlparser
htmlparser is a low-level, pull-based, zero-allocation HTML parser.
Example
for token in htmlparser::Tokenizer::from("<tagname name='value'/>") {
println!("{:?}", token);
}
Why a new library?
This library is basically a low-level XML tokenizer that preserves the positions of the tokens and is not intended to be used directly.
This library is a copy of xmlparser with some adjustments to parse html.
Benefits
- All tokens contain
StrSpan
structs which represent the position of the substring in the original document. - Good error processing. All error types contain the position (line:column) where it occurred.
- No heap allocations.
- No dependencies.
- Tiny. ~1400 LOC and ~30KiB in the release build according to
cargo-bloat
. - Supports
no_std
builds. To use without the standard library, disable the default features.
Limitations
- Currently, only ENTITY objects are parsed from the DOCTYPE. All others are ignored.
- No tree structure validation. So an XML like
<root><child></root></child>
or a string without root element will be parsed without errors. You should check for this manually. On the other hand<a/><a/>
will lead to an error. - Duplicated attributes is not an error. So XML like
<item a="v1" a="v2"/>
will be parsed without errors. You should check for this manually. - UTF-8 only.
Safety
- The library must not panic. Any panic is considered a critical bug and should be reported.
- The library forbids unsafe code.
License
Licensed under either of
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.
Contribution
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.