#plain-text #language-text #serialization #html #postcard #pile #marker #unobtanium

unobtanium-text-pile

Turns HTML into externally annotated plain text that is optimized for being serialized to the postcard format

1 unstable release

Uses new Rust 2024

0.1.0 Oct 3, 2025

#852 in Data structures

Download history 139/week @ 2025-10-02 15/week @ 2025-10-09 5/week @ 2025-10-16

159 downloads per month

LGPL-3.0-only

86KB
2K SLoC

Unobtanium Text Pile

Documentation | Codeberg

This rust crate turns HTML into externally annotated plain text that is optimized for being serialized to the postcard format.

It implements the TextPile v2 design without the "derived information" (segmenting and language detection) part.

What it does

The central piece of this crate is a TextPile which is a piece of text along with lists of semantic and language annotations which are stored as type, offset and length.

All offsets, lengths and sizes in this crate refer to bytes in the UTF-8 representation of the text.

Parsing HTML

This crate uses the scraper crate for parsing HTML and takes a document node reference, allowing use of incomplete documents, modifying the documents in a DOM like way before turning them into text piles (i.e. to remove navigation and footer elements) and extracting other data from the parsed HTML without having to run the parser again.

This feature can be enabled using the scraper feature flag.

Dehydrating and Hydrating

Often there are multiple ways data can be represented, some of these are better for processing the data, some take less space.

The TextPile is a version that is optimized for being worked with, while the DehydratedTextPileMetadata is the text pile metadata in a form that is optimized for being serialized to the postcard format, trying to reduce both, the amount of numbers needed to represent it and the value of those numbers (leading to less bytes per number, thanks to varint encoding).

Converting back and forward between both forms is called dehydrating and hydrating.

Markers

Semantic markers in the contxt of this crate are a flattened version of semantic HTML that is optimized for compact storage. They can communicate that text was inside a header, footer, main, paragraph, headline, link, … but not in which order these were nested or how many layers of them were present. Markers of different types may overlap, markers of the same type may not.

Not having any space between Markers is possible, this is used to encode paragraphs in a way that makes it easy to recognize blocks of paragraphs.

Language Spans

Language spans are similar to markers, but instead of semantics they communicate which language the text was tagged with, language spans may not overlap.

Language spans can encode a subset of bcp47 language tags that start with a primary language tag and may have an extended language, script and region subtag, everything after is truncated, tags that don't start with a primary language tag are currently ignored.

License

The Unobtanium Text Pile is licensed under the LGPL-3.0-only license.

The project aims to be compliant with version 3.3 of the reuse specification.

Dependencies

~3–9MB
~177K SLoC