516 stable releases

new 1.93.9 Apr 25, 2024
1.89.5 Mar 30, 2024
1.80.37 Dec 31, 2023
1.50.17 Nov 30, 2023
1.10.7 Jul 27, 2022

#201 in Web programming

Download history 129/week @ 2024-01-01 101/week @ 2024-01-08 32/week @ 2024-01-15 49/week @ 2024-01-22 18/week @ 2024-01-29 3/week @ 2024-02-05 1990/week @ 2024-02-19 1145/week @ 2024-02-26 304/week @ 2024-03-04 578/week @ 2024-03-11 1712/week @ 2024-03-18 2658/week @ 2024-03-25 1240/week @ 2024-04-01 976/week @ 2024-04-08 464/week @ 2024-04-15

5,557 downloads per month

MIT license

545KB
10K SLoC

Spider CLI

crate version

A fast command line spider or crawler.

Dependencies

On Linux

  • OpenSSL 1.0.1, 1.0.2, 1.1.0, or 1.1.1

Usage

The CLI is a binary so do not add it to your Cargo.toml file.

cargo install spider_cli

Cli

The following can also be ran via command line to run the crawler. If you need loging pass in the -v flag.

spider -v --url https://choosealicense.com crawl

Crawl and output all links visited to a file.

spider --url https://choosealicense.com crawl -o > spider_choosealicense.json

Download all html to local destination. Use the option -t to pass in the target destination folder.

spider --url https://choosealicense.com download -t _temp_spider_downloads

Set a crawl budget and only crawl one domain.

spider --url https://choosealicense.com --budget "*,1" crawl -o

Set a crawl budget and only allow 10 pages matching the /blog/ path and limit all pages to 100.

spider --url https://choosealicense.com --budget "*,100,/blog/,10" crawl -o
The fastest web crawler CLI written in Rust.

Usage: spider [OPTIONS] --url <DOMAIN> [COMMAND]

Commands:
  crawl     Crawl the website extracting links
  scrape    Scrape the website extracting html and links
  download  Download html markup to destination
  help      Print this message or the help of the given subcommand(s)

Options:
  -d, --url <DOMAIN>                Domain to crawl
  -r, --respect-robots-txt             Respect robots.txt file
  -s, --subdomains                     Allow sub-domain crawling
  -t, --tld                            Allow all tlds for domain
  -v, --verbose                        Print page visited on standard output
  -D, --delay <DELAY>                  Polite crawling delay in milli seconds
  -b, --blacklist-url <BLACKLIST_URL>  Comma seperated string list of pages to not crawl or regex with feature enabled
  -u, --user-agent <USER_AGENT>        User-Agent
  -B, --budget <BUDGET>                Crawl Budget
  -h, --help                           Print help
  -V, --version                        Print version

All features are available except the Website struct on_link_find_callback configuration option.

Dependencies

~12–31MB
~510K SLoC