1 unstable release
Uses new Rust 2024
| 0.1.0 | Oct 3, 2025 |
|---|
#852 in Data structures
159 downloads per month
86KB
2K
SLoC
Unobtanium Text Pile
This rust crate turns HTML into externally annotated plain text that is optimized for being serialized to the postcard format.
It implements the TextPile v2 design without the "derived information" (segmenting and language detection) part.
What it does
The central piece of this crate is a TextPile which is a piece of text along with lists of semantic and language annotations which are stored as type, offset and length.
All offsets, lengths and sizes in this crate refer to bytes in the UTF-8 representation of the text.
Parsing HTML
This crate uses the scraper crate for parsing HTML and takes a document node reference, allowing use of incomplete documents, modifying the documents in a DOM like way before turning them into text piles (i.e. to remove navigation and footer elements) and extracting other data from the parsed HTML without having to run the parser again.
This feature can be enabled using the scraper feature flag.
Dehydrating and Hydrating
Often there are multiple ways data can be represented, some of these are better for processing the data, some take less space.
The TextPile is a version that is optimized for being worked with, while the DehydratedTextPileMetadata is the text pile metadata in a form that is optimized for being serialized to the postcard format, trying to reduce both, the amount of numbers needed to represent it and the value of those numbers (leading to less bytes per number, thanks to varint encoding).
Converting back and forward between both forms is called dehydrating and hydrating.
Markers
Semantic markers in the contxt of this crate are a flattened version of semantic HTML that is optimized for compact storage. They can communicate that text was inside a header, footer, main, paragraph, headline, link, … but not in which order these were nested or how many layers of them were present. Markers of different types may overlap, markers of the same type may not.
Not having any space between Markers is possible, this is used to encode paragraphs in a way that makes it easy to recognize blocks of paragraphs.
Language Spans
Language spans are similar to markers, but instead of semantics they communicate which language the text was tagged with, language spans may not overlap.
Language spans can encode a subset of bcp47 language tags that start with a primary language tag and may have an extended language, script and region subtag, everything after is truncated, tags that don't start with a primary language tag are currently ignored.
License
The Unobtanium Text Pile is licensed under the LGPL-3.0-only license.
The project aims to be compliant with version 3.3 of the reuse specification.
Dependencies
~3–9MB
~177K SLoC