#web-scraping #scraping #spider #crawler #web #async #http-request

bin+lib turboscraper

A high-performance, concurrent web scraping framework for Rust with built-in support for retries, storage backends, and concurrent request handling

1 unstable release

0.1.0 Dec 29, 2024

#763 in Web programming

Download history 93/week @ 2024-12-24 10/week @ 2024-12-31

103 downloads per month

MIT license

90KB
2.5K SLoC

TurboScraper

A high-performance, concurrent web scraping framework for Rust, powered by Tokio. TurboScraper provides a robust foundation for building scalable web scrapers with built-in support for retries, storage backends, and concurrent request handling.

Features

  • 🚀 High Performance: Built on Tokio for async I/O and concurrent request handling
  • 🔄 Smart Retries: Configurable retry mechanisms for both HTTP requests and parsing failures
  • 💾 Multiple Storage Backends: Support for MongoDB and filesystem storage
  • 🎯 Type-safe: Leverages Rust's type system for reliable data extraction
  • 🔧 Configurable: Extensive configuration options for crawling behavior
  • 🛡️ Error Handling: Comprehensive error handling and reporting
  • 📊 Statistics: Built-in request statistics and performance monitoring

Quick Start

Add TurboScraper to your Cargo.toml:

[dependencies]
turboscraper = { version = "0.1.0" }

Basic Spider Example

Here's a simple spider that scrapes book information:

use turboscraper::prelude::*;

pub struct BookSpider {
    config: SpiderConfig,
    storage: Box<dyn StorageBackend>,
    storage_config: Box<dyn StorageConfig>,
}

#[async_trait]
impl Spider for BookSpider {
    fn name(&self) -> String {
        "book_spider".to_string()
    }

    fn start_urls(&self) -> Vec<Url> {
        vec![Url::parse("https://books.toscrape.com/").unwrap()]
    }

    async fn parse(
        &self,
        response: SpiderResponse,
        url: Url,
        depth: usize,
    ) -> ScraperResult<ParseResult> {
        match response.callback {
            SpiderCallback::Bootstrap => {
                // Parse book list and return new requests
                let new_requests = parse_book_list(&response.body)?;
                Ok(ParseResult::Continue(new_requests))
            }
            SpiderCallback::ParseItem => {
                // Parse and store book details
                self.parse_book_details(response).await?;
                Ok(ParseResult::Skip)
            }
            _ => Ok(ParseResult::Skip),
        }
    }
}

Running the Spider

use turboscraper::storage::factory::{create_storage, StorageType};

#[tokio::main]
async fn main() -> ScraperResult<()> {
    // Initialize storage
    let storage = create_storage(StorageType::Disk {
        path: "data/books".to_string(),
    }).await?;

    // Create and configure spider
    let spider = BookSpider::new(storage).await?;
    let config = SpiderConfig::default()
        .with_depth(2)
        .with_concurrency(10);
    let spider = spider.with_config(config);

    // Create crawler and run spider
    let scraper = Box::new(HttpScraper::new());
    let crawler = Crawler::new(scraper);
    crawler.run(spider).await?;
    Ok(())
}

Advanced Features

Retry Configuration

TurboScraper supports sophisticated retry mechanisms:

let mut retry_config = RetryConfig::default();
retry_config.categories.insert(
    RetryCategory::HttpError,
    CategoryConfig {
        max_retries: 3,
        initial_delay: Duration::from_secs(1),
        max_delay: Duration::from_secs(60),
        conditions: vec![
            RetryCondition::Request(RequestRetryCondition::StatusCode(429)),
        ],
        backoff_policy: BackoffPolicy::Exponential { factor: 2.0 },
    },
);

Storage Backends

TurboScraper supports multiple storage backends:

  • MongoDB: For scalable document storage
  • Filesystem: For local file storage
  • Custom: Implement the StorageBackend trait for custom storage solutions

Error Handling

Comprehensive error handling with custom error types:

match result {
    Ok(ParseResult::Continue(requests)) => // Handle new requests
    Ok(ParseResult::RetryWithSameContent(response)) => // Retry parsing
    Err(ScraperError::StorageError(e)) => // Handle storage errors
    Err(ScraperError::HttpError(e)) => // Handle HTTP errors
}

Best Practices

  1. Respect Robots.txt: Always check and respect website crawling policies
  2. Rate Limiting: Use appropriate delays between requests
  3. Error Handling: Implement proper error handling and retries
  4. Data Validation: Validate scraped data before storage
  5. Resource Management: Monitor memory and connection usage

Contributing

Contributions are welcome! Please feel free to submit pull requests.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Dependencies

~23–36MB
~580K SLoC