10 releases
0.1.9 | Jun 13, 2024 |
---|---|
0.1.8 | Sep 3, 2023 |
0.1.6 | Aug 30, 2023 |
#1108 in Web programming
19KB
262 lines
🕷️ crawly
A lightweight and efficient web crawler in Rust, optimized for concurrent scraping while respecting robots.txt
rules.
🚀 Features
- Concurrent crawling: Takes advantage of concurrency for efficient scraping across multiple cores;
- Respects
robots.txt
: Automatically fetches and adheres to website scraping guidelines; - DFS algorithm: Uses a depth-first search algorithm to crawl web links;
- Customizable with Builder Pattern: Tailor the depth of crawling, rate limits, and other parameters effortlessly;
- Cloudflare's detection: If the destination URL is hosted with Cloudflare and a mitigation is found, the URL will be skipped;
- Built with Rust: Guarantees memory safety and top-notch speed.
📦 Installation
Add crawly
to your Cargo.toml
:
[dependencies]
crawly = "^0.1"
🛠️ Usage
A simple usage example:
use anyhow::Result;
use crawly::Crawler;
#[tokio::main]
async fn main() -> Result<()> {
let crawler = Crawler::new()?;
let results = crawler.crawl_url("https://example.com").await?;
for (url, content) in &results {
println!("URL: {}\nContent: {}", url, content);
}
Ok(())
}
Using the Builder
For more refined control over the crawler's behavior, the CrawlerBuilder comes in handy:
use anyhow::Result;
use crawly::CrawlerBuilder;
#[tokio::main]
async fn main() -> Result<()> {
let crawler = CrawlerBuilder::new()
.with_max_depth(10)
.with_max_pages(100)
.with_max_concurrent_requests(50)
.with_rate_limit_wait_seconds(2)
.with_robots(true)
.build()?;
let results = crawler.start("https://www.example.com").await?;
for (url, content) in &results {
println!("URL: {}\nContent: {}", url, content);
}
Ok(())
}
🛡️ Cloudflare
This crate will detect Cloudflare hosted sites and if the header cf-mitigated
is found, the URL will be skipped
without
throwing any error.
📜 Tracing
Every function is instrumented, also this crate will emit some DEBUG messages for better comprehending the crawling flow.
🤝 Contributing
Contributions, issues, and feature requests are welcome!
Feel free to check issues page. You can also take a look at the contributing guide.
📝 License
This project is MIT licensed.
💌 Contact
- Author: Dario Cancelliere
- Email: dario@ai-chat.it
- Company Website: https://ai-chat.it
Dependencies
~11–24MB
~360K SLoC