8 releases
| 0.2.0-rc.4 | Oct 4, 2025 |
|---|---|
| 0.1.6 | Sep 23, 2025 |
| 0.1.2 | Aug 15, 2025 |
#261 in Data structures
420 downloads per month
165KB
3.5K
SLoC
html-mumu — HTML helpers for the Lava/MuMu language
Version: 0.2.0-rc.1
Repository: gitlab.com/tofo/html-mumu
License: MIT OR Apache-2.0 (dual)
html-mumu is a lean HTML toolkit for the Lava/MuMu ecosystem. It focuses on practical scraping/transformation tasks without pulling in a full DOM parser. Most operations use tolerant regular expressions and simple heuristics so they’re fast, dependency-light, and easy to compose in MuMu flows.
Highlights
- Mapper functions that turn HTML (or node-like values) into strings or structured MuMu values (e.g., text extraction, attributes, metadata).
- Predicate functions for quick filtering (
matches,contains_text,has_attr,is_article,domain_is). - Streaming stages that integrate with Flow-style pipelines (
select,split,links,tables,extract_text_stage,from_string). - URL utilities & metadata: resolve relative links, read canonical URL/meta tags, extract JSON-LD, and detect next-page links.
- Table helpers to coerce simple
<table>markup into MuMu arrays. - No heavy deps: uses
regexandonce_cell. Compatible with native andwasm32targets.
Design trade-off: This crate prioritizes good enough HTML handling with predictable performance. For highly irregular markup, nested/invalid HTML, or complex CSS selectors, a full parser would be more robust.
Data model & inputs
Most functions accept either:
- a raw HTML string, or
- a node-like MuMu value (
Value::KeyedArray) with common fields:outer_html,inner_html,text,tag,attrs(keyed map),base_url.
Where needed, functions try to coerce the input to HTML using:
SingleString(s)→sStrArray([s])→sKeyedArray→ preferouter_html/inner_html/text
Return conventions:
- Missing value → the MuMu placeholder
_(e.g., an absent attribute or canonical URL). - Predicates →
Bool. - Collections →
StrArrayorMixedArray(for 2-D tables). - Stages → a zero-argument transform function that yields one item per tick and ends with
NO_MORE_DATA.
Selectors (minimal CSS-like)
Supported by select/split/each_attr:
tag— matches elements by tag name (case-insensitive)..class— matches tokens in theclassattribute.#id— matches exactid.
These are best-effort and regex-based; they do not implement the full CSS spec.
Function reference
Sources & stages (streaming)
-
html:from_string(value) -> transform<string>
One-shot source that yields the given HTML once, then ends. -
html:select(selector, source) -> transform<string>
For each upstream item, emits outer HTML of all elements matchingselector. -
html:split(selector, source) -> transform<string>
Same engine asselect; intended semantically for document chunking. -
html:each_attr(selector, name, source) -> transform<string>
For each element matchingselector, emits the attribute valuename(if present). -
html:links(source) -> transform<string>
Emits all anchorhrefs found upstream. If a base URL is known, links are resolved to absolute. -
html:tables(source) -> transform<string>
Emits outer HTML of each<table>found upstream. -
html:extract_text_stage(source) -> transform<string>
Emits visible text for each upstream item (tags/scripts/styles removed, whitespace normalized).
Partial application &
_: Stages support building up arguments iteratively. Supplying_for a slot defers it.
Mappers & predicates
-
html:extract_text(value) -> string
Visible text (script/style/noscript stripped; tags removed; whitespace collapsed).
Alias:html:text. -
html:inner_html(value) -> string | _
Nodeinner_htmlif present; else heuristically strip the outer tag; else nodetextor raw string. -
html:outer_html(value) -> string | _
Nodeouter_htmlif present; elseinner_htmlortext; else returns the given string. -
html:attr(name, value) -> string | _
Attribute from nodeattrs[name]or opening tag of a string element. -
html:has_attr(name, value[, expected]) -> bool
trueif attribute exists (or equalsexpectedif provided). -
html:matches(selector, value) -> bool
Checks whethervaluematches a minimal selector (tag /.class/#id). -
html:contains_text(pattern, value) -> bool
Case-insensitive substring test against visible text. -
html:is_article(value) -> bool
Heuristic check: prefers<article>…</article>or sufficiently long paragraph blocks.
URLs & metadata
-
html:absolute_url(href, base_like) -> string
Resolvehrefagainstbase_like, where base can be a URL string, a node withbase_url, or HTML containing<base href="…">. -
html:domain_is(domain, value) -> bool
Extracts a likely URL fromvalue(nodebase_url,attrs.href, a string URL, or<base href>); compares domains (case-insensitive, stripswww.). -
html:canonical_url(value) -> string | _
Extracts<link rel="canonical" href="…">. -
html:meta(name_or_property, value) -> string | _
Reads<meta name="…">or<meta property="…">and returns itscontent. -
html:jsonld(value) -> StrArray
Returns raw JSON strings from<script type="application/ld+json">blocks. -
html:next_href(value[, base_like]) -> string | _
Finds “next page” links via common patterns (rel="next",class~="next", or suggestive link text like “next”, “older”,»). Returns an absolute URL if base is known.
Tables
html:table_to_2d(value) -> MixedArray(StrArray[])
Converts the first table invalueinto a 2-D structure (rows asStrArray). Designed for simple tables.
Tag stripping
html:strip_tags(value [, allowed]) -> string
With 1 arg: remove all tags fromvalue.
With 2 args: keep only tags listed inallowed(array or comma-separated string).script/style/noscriptare always dropped.
URL resolution details
html:absolute_url and streaming html:links can form absolute URLs using:
- A node’s
base_url - A
<base href="…">in the HTML - A provided
base_likeparameter
The resolver supports:
http/httpsschemes- Protocol-relative URLs (
//host/path) - Root-relative (
/path) and relative paths with./..normalization
Error handling & signals
- Arity/type errors → descriptive error strings.
- Missing data → returns the placeholder
_instead of error (e.g., absent attribute/meta). - Stages:
- Yield one item per call.
- End of stream →
NO_MORE_DATA. - Non-blocking by design;
AGAINis reserved for future async waits (not used in this crate).
Performance notes
- All selectors and many utilities are regex based (
regex = "1.11"), compiled once viaonce_celland reused. - Functions aim to be allocation-aware, but streaming HTML or extremely large documents can still be expensive—prefer narrowing with
select/splitfirst, then mapping.
Build targets & integration
- Native (non-wasm): exports a dynamic loader entrypoint named
Cargo_lock. Callingextend("html")in Lava/MuMu hosts registers all functions. wasm32: does not exportCargo_lock. Callregister_all(interp)from your host.- The crate depends on
core-mumu = 0.9.0-rc.3and selects host/wasm features via target-specific dependency sections.
Feature flags in this crate (
host,web) are markers for ecosystem parity; target-specificcore-mumufeatures actually control host vs wasm behavior.
Quick examples
# Visible text from a snippet
html:extract_text("<p>Hello <b>world</b></p>") # → "Hello world"
# Does a node/string contain text?
html:contains_text("privacy", html_value) # → true/false
# Resolve a link against a base
html:absolute_url("../page/2", "https://example.com/blog/1")
# Stream all links (already resolved if base is known)
links = html:links(html:from_string(page_html))
links() # → "https://example.com/a"
links() # → "https://example.com/b"
# ... then NO_MORE_DATA
Directory layout (high level)
src/share/…— small, pure helpers (selectors, URL utils, text stripping, table parsing, readability-ish, pagination).src/register/…— one MuMu function per file, each responsible for registering a single public symbol.src/lib.rs— wires upregister_alland theCargo_lockentrypoint for native builds.
Compatibility
- MuMu/Core: designed for
core-mumu0.9.0-rc.3. - Platforms: native and
wasm32(nocrossterm/libloadingin wasm). - Host loaders: dynamic loading via
extend("html")on native; callregister_all(interp)on wasm/static.
Contributing
Issues and merge requests are welcome at https://gitlab.com/tofo/html-mumu.
Please keep changes small and additive; the crate values predictable behavior and low dependency surface.
Acknowledgements
- Tom Fotheringham and the MuMu/Lava community for design and stewardship across the plugin ecosystem.
- Contributors to
core-mumuand related plugins for patterns around dynamic registration and Flow stages. - The Rust
regexandonce_cellmaintainers for foundational crates used here.
License
Licensed under either of:
- MIT license
- Apache-2.0 license
at your option.
See the repository for the full text of each license.
Dependencies
~3.5–5MB
~84K SLoC