#crawler #spider

spider

Multithreaded Web spider crawler written in Rust

10 stable releases

Uses old Rust 2015

1.2.1 Mar 28, 2021
1.2.0 Feb 12, 2021
1.1.2 Oct 12, 2019
1.1.1 Feb 12, 2018

#131 in Web programming

Download history 61/week @ 2021-05-28 22/week @ 2021-06-04 19/week @ 2021-06-11 5/week @ 2021-06-18 41/week @ 2021-06-25 99/week @ 2021-07-02 42/week @ 2021-07-09 17/week @ 2021-07-16 10/week @ 2021-07-23 36/week @ 2021-07-30 21/week @ 2021-08-06 16/week @ 2021-08-13 12/week @ 2021-08-20 3/week @ 2021-08-27 9/week @ 2021-09-03 23/week @ 2021-09-10

115 downloads per month
Used in website_crawler

MIT license

12KB
223 lines

Spider

crate version

Multithreaded Web spider crawler written in Rust.

Dependencies

$ apt install openssl libssl-dev

Usage

Add this dependency to your Cargo.toml file.

[dependencies]
spider = "1.0.2"

and then you'll be able to use library. Here a simple example

extern crate spider;

use spider::website::Website;

fn main() {
    let mut website: Website = Website::new("https://choosealicense.com");
    website.crawl();

    for page in website.get_pages() {
        println!("- {}", page.get_url());
    }
}

You can use Configuration object to configure your crawler:

// ..
let mut website: Website = Website::new("https://choosealicense.com");
website.configuration.blacklist_url.push("https://choosealicense.com/licenses/".to_string());
website.configuration.respect_robots_txt = true;
website.configuration.verbose = true;
website.configuration.delay = 2000;
website.crawl();
// ..

TODO

  • multi-threaded system
  • respect robot.txt file
  • add configuration object for polite delay, etc..
  • add polite delay
  • parse command line arguments

Contribute

I am open-minded to any contribution. Just fork & commit on another branch.

Dependencies

~5–9MB
~205K SLoC