#html #css #scraper #web

macro crabler_derive

Derive macro for crabler library

9 releases

0.1.8 Jan 8, 2022
0.1.7 Dec 22, 2021
0.1.6 Dec 28, 2020
0.1.5 Feb 27, 2020

#518 in Web programming

Download history 7/week @ 2021-09-25 5/week @ 2021-10-02 12/week @ 2021-10-09 8/week @ 2021-10-16 8/week @ 2021-10-23 4/week @ 2021-10-30 10/week @ 2021-11-06 3/week @ 2021-11-13 6/week @ 2021-11-20 13/week @ 2021-11-27 19/week @ 2021-12-04 9/week @ 2021-12-11 30/week @ 2021-12-18 10/week @ 2021-12-25 37/week @ 2022-01-01 52/week @ 2022-01-08

130 downloads per month
Used in crabler

MIT license

7KB
153 lines

Crabler - Web crawler for Crabs

CI Crates.io docs.rs MIT licensed

Asynchronous web scraper engine written in rust.

Features:

  • fully based on async-std
  • derive macro based api
  • struct based api
  • stateful scraper (structs can hold state)
  • ability to download files
  • ability to schedule navigation jobs in an async manner

Example

extern crate crabler;

use crabler::*;

#[derive(WebScraper)]
#[on_response(response_handler)]
#[on_html("a[href]", walk_handler)]
struct Scraper {}

impl Scraper {
    async fn response_handler(&self, response: Response) -> Result<()> {
        if response.url.ends_with(".jpg") && response.status == 200 {
            println!("Finished downloading {} -> {}", response.url, response.download_destination);
        }
        Ok(())
    }

    async fn walk_handler(&self, response: Response, a: Element) -> Result<()> {
        if let Some(href) = a.attr("href") {
            // attempt to download an image
            if href.ends_with(".jpg") {
                let p = Path::new("/tmp").join("image.jpg");
                let destination = p.to_string_lossy().to_string();

                if !p.exists() {
                    println!("Downloading {}", destination);
                    // schedule crawler to download file to some destination
                    // downloading will happen in the background, await here is just to wait for job queue
                    response.download_file(href, destination).await?;
                } else {
                    println!("Skipping exist file {}", destination);
                }
            } else {
              // or schedule crawler to navigate to a given url
              response.navigate(href).await?;
            };
        }

        Ok(())
    }
}

#[async_std::main]
async fn main() -> Result<()> {
    let scraper = Scraper {};

    // Run scraper starting from given url and using 20 worker threads
    scraper.run(Opts::new().with_urls(vec!["https://www.rust-lang.org/"]).with_threads(20)).await
}

Sample project

Gonzih/apod-nasa-scraper-rs

Dependencies

~0.4–0.8MB
~19K SLoC