#url #sitemap #crawler

app wls

Easily crawl multiple sitemaps and list URLs

1 unstable release

0.1.0 Feb 4, 2024

#985 in Command line utilities

MIT license

15KB
209 lines

wls

wls (web ls) makes it easy to crawl multiple sitemaps and list URLs. It can even automatically find sitemaps for a domain using robots.txt.

Usage

wls accepts multiple domains/sitemaps as arguments, and will print all found URLs to stdout:

$ wls docs.rs > urls.txt

$ head -n 6 urls.txt 
https://docs.rs/A-1/latest/A_1/
https://docs.rs/A-1/latest/A_1/all.html
https://docs.rs/A5/latest/A5/
https://docs.rs/A5/latest/A5/all.html
https://docs.rs/AAAA/latest/AAAA/
https://docs.rs/AAAA/latest/AAAA/all.html

$ grep /all.html urls.txt | wc -l
113191
# that's a lot of crates!

If an argument does not contain a slash, it is treated as a domain, and wls will automatically attempt to find sitemaps using robots.txt. For example, docs.rs uses the Sitemap: directive in its robots.txt file, so the following commands are equivalent:

$ wls docs.rs
$ wls https://docs.rs/robots.txt
$ wls https://docs.rs/sitemap.xml

wls will print logs to stderr when -v/--verbose is enabled:

$ wls -v docs.rs
   Found 1 sitemaps
    in robotstxt with url: https://docs.rs/robots.txt

   Found 26 sitemaps
    in sitemap with url: https://docs.rs/sitemap.xml
    in robotstxt with url: https://docs.rs/robots.txt

   Found 15934 URLs
    in sitemap with url: https://docs.rs/-/sitemap/a/sitemap.xml
    in sitemap with url: https://docs.rs/sitemap.xml
    in robotstxt with url: https://docs.rs/robots.txt

   Found 11170 URLs
    in sitemap with url: https://docs.rs/-/sitemap/b/sitemap.xml
    in sitemap with url: https://docs.rs/sitemap.xml
    in robotstxt with url: https://docs.rs/robots.txt

  ...

More options are available too:

$ wls --help
Usage: wls [OPTIONS] <URLS>...

Arguments:
  <URLS>...  Domains/sitemaps to crawl

Options:
  -T, --timeout <SECONDS>  Maximum response time [default: 30]
  -w, --wait <SECONDS>     Delay between requests [default: 0]
  -v, --verbose            Enable logs
  -h, --help               Print help
  -V, --version            Print version

Dependencies

~7–20MB
~305K SLoC