#html-parser #markup #web #parser #html

nightly oak-html

HTML markup language parser with support for web content and document structure processing

6 releases

Uses new Rust 2024

new 0.0.5 Mar 3, 2026
0.0.4 Feb 18, 2026
0.0.1 Jan 23, 2026
0.0.0 Oct 21, 2025

#2782 in Parser implementations

MPL-2.0 license

385KB
6.5K SLoC

🚀 Oak HTML Parser

Crates.io Documentation

Structuring the Web with Precision — A high-performance, incremental HTML parser built on the Oak framework. Optimized for web scraping, static analysis, and modern IDE support for web development.

🎯 Project Vision

HTML is the backbone of the web, and its complexity often arises from its flexibility and real-world "tag soup." oak-html aims to provide a robust, high-performance parsing solution that can handle modern HTML5 standards with industrial-grade reliability. By utilizing Oak's incremental parsing capabilities, it enables the creation of highly responsive tools for web development—from real-time preview engines to intelligent code refactoring tools.

✨ Core Features

  • ⚡ Blazing Fast: Leverages Rust's performance to deliver sub-millisecond parsing, essential for real-time web development tools and large-scale web analysis.
  • 🔄 Incremental Parsing: Built-in support for partial updates—re-parse only the sections of the HTML that changed, significantly improving performance for complex web pages.
  • 🌳 High-Fidelity AST: Generates a detailed and easy-to-traverse Abstract Syntax Tree capturing:
    • Elements, Attributes, and nested structures
    • Comments, Doctype declarations, and Text nodes
    • Support for modern HTML5 features
  • 🛡️ Industrial-Grade Error Recovery: Engineered to handle malformed or "tag soup" HTML gracefully, providing precise diagnostics while maintaining a valid tree structure.
  • 🧩 Ecosystem Integration: Part of the Oak family—easily integrate with oak-lsp for full LSP support or other Oak-based web analysis utilities.

🏗️ Architecture

The parser follows the Green/Red Tree architecture (inspired by Roslyn), which allows for:

  1. Efficient Immutability: Share nodes across different versions of the tree without copying.
  2. Lossless Syntax Trees: Retains all trivia (whitespace and comments), enabling faithful code formatting and refactoring.
  3. Type Safety: Strongly-typed "Red" nodes provide a convenient and safe API for tree traversal and analysis.

🤝 Contributing

We welcome contributions of all kinds! If you find a bug, have a feature request, or want to contribute code, please check our issues or submit a pull request.

Dependencies

~5–10MB
~105K SLoC