#robots-txt #web-crawler #web-scraping #optimized #rules #async

crawly

A lightweight async Web crawler in Rust, optimized for concurrent scraping while respecting robots.txt rules

9 releases

0.1.8 Sep 3, 2023
0.1.7 Sep 3, 2023
0.1.6 Aug 30, 2023

#309 in Concurrency

50 downloads per month

Custom license

19KB
253 lines

🕷️ crawly

A lightweight and efficient web crawler in Rust, optimized for concurrent scraping while respecting robots.txt rules.

Crates.io License: MIT Version Repository Homepage

🚀 Features

  • Concurrent crawling: Takes advantage of concurrency for efficient scraping across multiple cores;
  • Respects robots.txt: Automatically fetches and adheres to website scraping guidelines;
  • DFS algorithm: Uses a depth-first search algorithm to crawl web links;
  • Customizable with Builder Pattern: Tailor the depth of crawling, rate limits, and other parameters effortlessly;
  • Cloudflare's detection: If the destination URL is hosted with Cloudflare and a mitigation is found, the URL will be skipped;
  • Built with Rust: Guarantees memory safety and top-notch speed.

📦 Installation

Add crawly to your Cargo.toml:

[dependencies]
crawly = "^0.1"

🛠️ Usage

A simple usage example:

use anyhow::Result;
use crawly::Crawler;

#[tokio::main]
async fn main() -> Result<()> {
    let crawler = Crawler::new()?;
    let results = crawler.crawl_url("https://example.com").await?;

    for (url, content) in &results {
        println!("URL: {}\nContent: {}", url, content);
    }

    Ok(())
}

Using the Builder

For more refined control over the crawler's behavior, the CrawlerBuilder comes in handy:

use anyhow::Result;
use crawly::CrawlerBuilder;

#[tokio::main]
async fn main() -> Result<()> {
    let crawler = CrawlerBuilder::new()
        .with_max_depth(10)
        .with_max_pages(100)
        .with_max_concurrent_requests(50)
        .with_rate_limit_wait_seconds(2)
        .with_robots(true)
        .build()?;
    
    let results = crawler.crawl_url("https://www.example.com").await?;

    for (url, content) in &results {
        println!("URL: {}\nContent: {}", url, content);
    }

    Ok(())
}

🛡️ Cloudflare

This crate will detect Cloudflare hosted sites and if the header cf-mitigated is found, the URL will be skipped without throwing any error.

📜 Tracing

Every function is instrumented, also this crate will emit some DEBUG messages for better comprehending the crawling flow.

🤝 Contributing

Contributions, issues, and feature requests are welcome!

Feel free to check issues page. You can also take a look at the contributing guide.

📝 License

This project is MIT licensed.

💌 Contact

Dependencies

~10–29MB
~447K SLoC