#web-crawler #transformation

spider_transformations

Transformation utils to use for spider

731 stable releases

new 2.37.112 Dec 16, 2025
2.37.109 Jul 8, 2025
2.36.7 Mar 31, 2025
2.23.3 Dec 31, 2024
0.0.3 Sep 21, 2024

#1970 in Command line utilities

Download history 199/week @ 2025-08-25 143/week @ 2025-09-01 147/week @ 2025-09-08 67/week @ 2025-09-15 60/week @ 2025-09-22 74/week @ 2025-09-29 58/week @ 2025-10-06 148/week @ 2025-10-13 57/week @ 2025-10-20 98/week @ 2025-10-27 165/week @ 2025-11-03 133/week @ 2025-11-10 226/week @ 2025-11-17 306/week @ 2025-11-24 488/week @ 2025-12-01 279/week @ 2025-12-08

1,325 downloads per month
Used in search_for_llms

MIT license

210KB
5K SLoC

spider_transformations

A high-performance transformation library for Rust, used by Spider Cloud for AI-powered content cleaning across multiple locales.

This project depends on the spider crate.

Usage

[dependencies]
spider_transformations = "2"
use spider_transformations::transformation::content;

fn main() {
    // page comes from the spider object when streaming.
    let mut conf = content::TransformConfig::default();
    conf.return_format = content::ReturnFormat::Markdown;
    let content = content::transform_content(&page, &conf, &None, &None);
}

Transform types

  1. Markdown
  2. Commonmark
  3. Text
  4. Markdown (Text Map) or HTML2Text
  5. WIP: HTML2XML

Enhancements

  1. Readability
  2. Encoding

Chunking

There are several chunking utils in the transformation mod.

This project has rewrites and forks of html2md, and html2text for performance and bug fixes.

License

MIT

Dependencies

~24–46MB
~740K SLoC