5 releases
| 0.1.4 | Dec 7, 2025 |
|---|---|
| 0.1.3 | Nov 14, 2025 |
| 0.1.2 | Nov 5, 2025 |
| 0.1.1 | Nov 5, 2025 |
| 0.1.0 | Nov 5, 2025 |
#326 in HTTP client
38KB
824 lines
snagger
Grab the full text of paginated articles where the next page is controlled by a ?page=N query parameter. Snagger discovers the page count, scrapes each page concurrently, and assembles the cleaned content into a single file per source URL.
Highlights
- Async Rust CLI with polite throttling, concurrency control, and optional proxy support
- Automatic pagination discovery via
rel="last", anchor inspection, or custom selector/regex pairs - Flexible content extraction using CSS selectors with heuristics for common article layouts
- Produces wrapped plaintext output alongside a crawl log CSV for auditability
Getting Started
Prerequisites
- Rust 1.75+ with
cargoon your PATH - OpenSSL or platform-native TLS libraries required by
reqwest
Installation
Install the crate:
cargo install snagger
Update existing install:
cargo install snagger --force
Or build locally:
git clone https://github.com/fibnas/snagger
cd snagger
cargo run --release
Input File Format
- Plain text file
- One URL per line
- Lines beginning with
#and blank lines are ignored
Example (links.txt):
https://example.com/articles/alpha
# https://example.com/articles/beta (commented out)
https://example.com/articles/gamma?page=1
Usage
snagger links.txt --out snags --concurrency 8 --discover-pages
This command:
- Reads seeds from
links.txt - Writes merged plaintext files to the
snags/directory (creating it when absent) - Allows up to eight concurrent crawls
- Enables pagination discovery to stop once the last page is detected
Direct Downloads
- Use
--download-ext <EXT>to download URLs whose path ends in that extension instead of scraping - Repeat the flag or provide comma-separated values to allow multiple extensions (e.g.
--download-ext zip --download-ext pdf) - Files are saved using the slug with the requested extension, such as
example.com__archive.zip
Output Layout
snags/<slug>.txt– wrapped article text built from the fetched pagessnags/_crawl_log.csv– crawl metadata containing:url,slug,pages_fetched,last_page_url,chars,seconds,status
Throttling and Reliability
- Delay between page fetches is randomized within the configured range (
--delay 0.4 1.2by default) - HTTP timeout defaults to 20s; adjust via
--timeout - Set
--stop-on-repeatto end pagination when a repeated MD5 hash indicates duplicate content
Content Extraction
- Provide a CSS selector with
--selector ".main-article"to target a specific container - Without a selector, Snagger attempts common article heuristics before falling back to full-page text
- Minimum characters per page default to 200 (
--min-chars), stopping early if pages are mostly boilerplate
Page Count Discovery
- Enable with
--discover-pages - Optionally scope page metadata via
--pages-selector "nav.pagination" - Supply a custom regex with a capture group (e.g.
--pages-regex "(?i)page\\s*\\d+\\s*of\\s*(\\d+)") - Falls back to the
--max-pageshard cap (default 20) when discovery fails
Networking
- Concurrency is capped with
--concurrency - Configure an HTTPS proxy via
--proxy http://host:port - Custom headers include a desktop browser
User-Agentto reduce blocks
Command Reference
| Flag | Default | Description |
|---|---|---|
--out <DIR> |
snags |
Output directory for scraped files and crawl logs |
--concurrency <N> |
6 |
Maximum concurrent subjects to crawl |
--selector <CSS> |
none | CSS selector for the main content container |
--min-chars <N> |
200 |
Minimum characters per page before keeping it |
--stop-on-repeat |
false |
Stop when consecutive pages hash to the same value |
--wrap-width <N> |
96 |
Column width for output wrapping (0 disables wrapping) |
--timeout <SECS> |
20 |
HTTP timeout per request |
--delay <LO HI> |
0.4 1.2 |
Randomized per-page delay range in seconds |
--max-pages <N> |
20 |
Hard cap on pagination depth |
--discover-pages |
false |
Attempt to detect the total page count |
--pages-selector <CSS> |
none | Scope for page-count detection |
--pages-regex <REGEX> |
none | Explicit regex for total page capture group |
--proxy <URL> |
none | HTTPS proxy endpoint |
--download-ext <EXT> |
none | Direct-download URLs ending in .<EXT> (repeat/comma-separated) |
Development
cargo check
cargo fmt
cargo clippy --all-targets -- -D warnings
Tests are not currently configured; feel free to contribute fixtures that capture additional pagination patterns.
Troubleshooting
- Ensure the target site allows crawling; respect robots and rate limits
- Increase
--timeoutor widen--delayif encountering rate limiting or slow responses - Use
--selectorto avoid pulling navigation or comments into the combined text - Inspect
_crawl_log.csvfor subjects markedemptyor early termination causes
License
Licensed under the MIT License. See LICENSE for details.
Dependencies
~30–47MB
~642K SLoC