#crawler #spider #website-crawler #url-finder #site-map-generator

nightly bin+lib website_crawler

crawl all urls on a website async & sync

3 releases

0.1.2 Mar 27, 2021
0.1.1 Mar 27, 2021
0.1.0 Mar 27, 2021

#7 in #spider

30 downloads per month

MIT license

26KB
125 lines

crawler

crawls websites to gather all possible urls

Getting Started

Make sure to have Rust installed.

make sure to create a .env file and add CRAWL_URL=http://0.0.0.0:8080/api/website-crawl. replace CRAWL_URL with your production endpoint to accept results. A valid endpoint to accept the hook is required for the crawler to work.

  1. curl --proto '=https' --tlsv1.2 https://sh.rustup.rs -sSf | sh
  2. cargo run

Docker

you can start the service with docker by running docker build -t crawler . && docker run -dp 8000:8000 crawler

compose

use the docker image

jeffmendez19/crawler

Crate

you can install the program as create at crate

API

crawl - async determine all urls in a website with a post hook

POST

http://localhost:8000/crawl

Body: { url: https://www.a11ywatch.com, id: 0 }

ENV

CARGO_RELEASE=false //determine if prod/dev build ROCKET_ENV=dev // determine api env CRAWL_URL="http://api:8080/api/website-crawl-background" // endpoint to send results

LICENSE

check the license file in the root of the project.

Dependencies

~17MB
~360K SLoC