6 releases

0.1.4 Aug 12, 2023
0.1.3-alpha May 15, 2023
0.1.2 May 7, 2023
0.1.1 May 7, 2023
0.1.0 May 7, 2023

#540 in Database interfaces

21 downloads per month

MIT license

29KB
526 lines

Waper

Waper is a CLI tool to scrape html websites. Here is a simple usage

waper --seed-links "https://example.com/" --whitelist "https://example.com/.*" --whitelist "https://www.iana.org/domains/example" 

This will scrape "https://example.com/" and save the html for each link found in a sqlite db with name waper_out.sqlite.

Installation

cargo install waper

CLI Usage

A CLI tool to scrape HTML websites

Usage: waper [OPTIONS]
       waper <COMMAND>

Commands:
  scrape      This is also default command, so it's optional to include in args
  completion  Print shell completion script
  help        Print this message or the help of the given subcommand(s)

Options:
  -w, --whitelist <WHITELIST>
          whitelist regexes: only these urls will be scanned other then seeds
  -b, --blacklist <BLACKLIST>
          blacklist regexes: these urls will never be scanned By default nothing will be blacklisted [default: a^]
  -s, --seed-links <SEED_LINKS>
          Links to start with
  -o, --output-file <OUTPUT_FILE>
          Sqlite output file [default: waper_out.sqlite]
  -m, --max-parallel-requests <MAX_PARALLEL_REQUESTS>
          Sqlite output file [default: 5]
  -i, --include-db-links
          Will also include unprocessed links from `links` table in db if present. Helpful when you want to continue the scraping from a previously unfinished session
  -v, --verbose
          Should verbose (debug) output
  -h, --help
          Print help
  -V, --version
          Print version

Querying data

Data is stored in sqlite db with schema defined in ./sqls/INIT.sql. There are three tables

  1. results: Stores the content of all the request for which a response was recieved
  2. errors: Stores the error message of all the cases where the request could not be completed
  3. links: Stores the urls of both visited or unvisited links

Result can be queried using any sqlite client. Example using sqlite cli:

$ sqlite3 waper_out.sqlite 'select url, time, length(html) from results'
https://example.com/|2023-05-07 06:47:33|1256
https://www.iana.org/domains/example|2023-05-07 06:47:39|80

For beautiful output you can modify sqlite3 settings:

$ sqlite3 waper_out.sqlite '.headers on' '.mode column' 'select url, time, length(html) from results'
url                                   time                 length(html)
------------------------------------  -------------------  ------------
https://example.com/                  2023-05-07 06:47:33  1256
https://www.iana.org/domains/example  2023-05-07 06:47:39  80

To quickly search through all the urls you can use fzf:

sqlite3 waper_out.sqlite 'select url from links' | fzf

Planned improvements

  • Allow users to specify priority for urls, so some urls can be scraped before others
  • Support complex rate-limits
  • Allow continuation of previously stopped scraping
    • Should continue working on IP roaming (auto-detect and continue)
  • Explicitly handling redirect
  • Allow users to modify part of request (like user-agent)
  • Improve storage efficiency by compressing/de-duping the html
  • Provide more visibility into how many urls are queued, at which rate are they getting processed etc
  • Support JS execution using ... (v8 or webkit, not many options)

Feedback

If you find any bugs or have any feature suggestions please file an issue on github.

Dependencies

~41–59MB
~1M SLoC