3 unstable releases

0.2.0	Apr 28, 2022
0.1.2	Jul 8, 2021
0.1.0	Aug 16, 2020

#2535 in Command line utilities

60 downloads per month

MIT/Apache

48KB
930 lines

SuckIT

SuckIT allows you to recursively visit and download a website's content to your disk.

SuckIT Logo

Features

Vacuums the entirety of a website recursively
Uses multithreading
Writes the website's content to your disk
Enables offline navigation
Offers random delays to avoid IP banning
Saves application state on CTRL-C for later pickup

Options

USAGE:
    suckit [FLAGS] [OPTIONS] <url>

FLAGS:
    -c, --continue-on-error                  Flag to enable or disable exit on error
        --dry-run                            Do everything without saving the files to the disk
    -h, --help                               Prints help information
    -V, --version                            Prints version information
    -v, --verbose                            Enable more information regarding the scraping process
        --visit-filter-is-download-filter    Use the dowload filter in/exclude regexes for visiting as well

OPTIONS:
    -a, --auth <auth>...
            HTTP basic authentication credentials space-separated as "username password host". Can be repeated for
            multiple credentials as "u1 p1 h1 u2 p2 h2"
        --delay <delay>
            Add a delay in seconds between downloads to reduce the likelihood of getting banned [default: 0]

    -d, --depth <depth>
            Maximum recursion depth to reach when visiting. Default is -1 (infinity) [default: -1]

    -e, --exclude-download <exclude-download>
            Regex filter to exclude saving pages that match this expression [default: $^]

        --exclude-visit <exclude-visit>
            Regex filter to exclude visiting pages that match this expression [default: $^]

        --ext-depth <ext-depth>
            Maximum recursion depth to reach when visiting external domains. Default is 0. -1 means infinity [default:
            0]
    -i, --include-download <include-download>
            Regex filter to limit to only saving pages that match this expression [default: .*]

        --include-visit <include-visit>
            Regex filter to limit to only visiting pages that match this expression [default: .*]

    -j, --jobs <jobs>                            Maximum number of threads to use concurrently [default: 1]
    -o, --output <output>                        Output directory
        --random-range <random-range>
            Generate an extra random delay between downloads, from 0 to this number. This is added to the base delay
            seconds [default: 0]
    -t, --tries <tries>                          Maximum amount of retries on download failure [default: 20]
    -u, --user-agent <user-agent>                User agent to be used for sending requests [default: suckit]

ARGS:
    <url>    Entry point of the scraping

Example

A common use case could be the following:

suckit http://books.toscrape.com -j 8 -o /path/to/downloaded/pages/

asciicast

Installation

As of right now, SuckIT does not work on Windows.

To install it, you need to have Rust installed.

Check out this link for instructions on how to install Rust.
If you just want to install the suckit executable, you can simply run cargo install --git https://github.com/skallwar/suckit
Now, run it from anywhere with the suckit command.

Arch Linux

suckit can be installed from available AUR packages using an AUR helper. For example,

yay -S suckit

Want to contribute ? Feel free to open an issue or submit a PR !

License

SuckIT is primarily distributed under the terms of both the MIT license and the Apache License (Version 2.0)

See LICENSE-APACHE and LICENSE-MIT for details.

Dependencies

~15–33MB
~493K SLoC