1 unstable release
0.1.0 | Dec 29, 2024 |
---|
#763 in Web programming
103 downloads per month
90KB
2.5K
SLoC
TurboScraper
A high-performance, concurrent web scraping framework for Rust, powered by Tokio. TurboScraper provides a robust foundation for building scalable web scrapers with built-in support for retries, storage backends, and concurrent request handling.
Features
- 🚀 High Performance: Built on Tokio for async I/O and concurrent request handling
- 🔄 Smart Retries: Configurable retry mechanisms for both HTTP requests and parsing failures
- 💾 Multiple Storage Backends: Support for MongoDB and filesystem storage
- 🎯 Type-safe: Leverages Rust's type system for reliable data extraction
- 🔧 Configurable: Extensive configuration options for crawling behavior
- 🛡️ Error Handling: Comprehensive error handling and reporting
- 📊 Statistics: Built-in request statistics and performance monitoring
Quick Start
Add TurboScraper to your Cargo.toml
:
[dependencies]
turboscraper = { version = "0.1.0" }
Basic Spider Example
Here's a simple spider that scrapes book information:
use turboscraper::prelude::*;
pub struct BookSpider {
config: SpiderConfig,
storage: Box<dyn StorageBackend>,
storage_config: Box<dyn StorageConfig>,
}
#[async_trait]
impl Spider for BookSpider {
fn name(&self) -> String {
"book_spider".to_string()
}
fn start_urls(&self) -> Vec<Url> {
vec![Url::parse("https://books.toscrape.com/").unwrap()]
}
async fn parse(
&self,
response: SpiderResponse,
url: Url,
depth: usize,
) -> ScraperResult<ParseResult> {
match response.callback {
SpiderCallback::Bootstrap => {
// Parse book list and return new requests
let new_requests = parse_book_list(&response.body)?;
Ok(ParseResult::Continue(new_requests))
}
SpiderCallback::ParseItem => {
// Parse and store book details
self.parse_book_details(response).await?;
Ok(ParseResult::Skip)
}
_ => Ok(ParseResult::Skip),
}
}
}
Running the Spider
use turboscraper::storage::factory::{create_storage, StorageType};
#[tokio::main]
async fn main() -> ScraperResult<()> {
// Initialize storage
let storage = create_storage(StorageType::Disk {
path: "data/books".to_string(),
}).await?;
// Create and configure spider
let spider = BookSpider::new(storage).await?;
let config = SpiderConfig::default()
.with_depth(2)
.with_concurrency(10);
let spider = spider.with_config(config);
// Create crawler and run spider
let scraper = Box::new(HttpScraper::new());
let crawler = Crawler::new(scraper);
crawler.run(spider).await?;
Ok(())
}
Advanced Features
Retry Configuration
TurboScraper supports sophisticated retry mechanisms:
let mut retry_config = RetryConfig::default();
retry_config.categories.insert(
RetryCategory::HttpError,
CategoryConfig {
max_retries: 3,
initial_delay: Duration::from_secs(1),
max_delay: Duration::from_secs(60),
conditions: vec![
RetryCondition::Request(RequestRetryCondition::StatusCode(429)),
],
backoff_policy: BackoffPolicy::Exponential { factor: 2.0 },
},
);
Storage Backends
TurboScraper supports multiple storage backends:
- MongoDB: For scalable document storage
- Filesystem: For local file storage
- Custom: Implement the
StorageBackend
trait for custom storage solutions
Error Handling
Comprehensive error handling with custom error types:
match result {
Ok(ParseResult::Continue(requests)) => // Handle new requests
Ok(ParseResult::RetryWithSameContent(response)) => // Retry parsing
Err(ScraperError::StorageError(e)) => // Handle storage errors
Err(ScraperError::HttpError(e)) => // Handle HTTP errors
}
Best Practices
- Respect Robots.txt: Always check and respect website crawling policies
- Rate Limiting: Use appropriate delays between requests
- Error Handling: Implement proper error handling and retries
- Data Validation: Validate scraped data before storage
- Resource Management: Monitor memory and connection usage
Contributing
Contributions are welcome! Please feel free to submit pull requests.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Dependencies
~23–36MB
~580K SLoC