4 releases (2 breaking)
Uses new Rust 2024
| 0.3.0 | Mar 14, 2026 |
|---|---|
| 0.2.0 | Feb 3, 2026 |
| 0.1.2 | Jan 26, 2026 |
| 0.1.1 |
|
| 0.1.0 | Jan 25, 2026 |
#231 in HTTP client
48KB
985 lines
crawn
A utility for web crawling and scraping
Features
- Blazing fast – Built with Rust & tokio for async I/O and concurrency
- Smart filtering – URL-based keyword matching (no content fetching required)
- NDJSON output – One JSON object per line for easy streaming
- BFS crawling – Depth-first traversal with configurable depth limits
- Rate limiting – Configurable request rate (default: ~2req/sec)
- Error recovery – Gracefully handles network errors and broken links
- Rich logging – Colored, timestamped logs with context chains
Installation
Run this command (requires cargo):
cargo install crawn
- Or build from source (requires cargo):
git clone https://github.com/Tahaa-Dev/crawn.git
cd crawn
cargo build --release
Usage
- Basic Crawling:
crawn https://example.com
- With Logging:
crawn -l crawler.log https://example.com
- Verbose Mode (Log All Requests):
crawn -v https://example.com | bat --language json
- Custom Depth Limit:
crawn -m 3 https://example.com | grep 'rust' | cat > output.ndjson
- Full HTML:
crawn --include-content https://example.com | jq -s '.'
- Extracted text only:
crawn --include-text https://example.com | grep 'rust' | sed -i 's/[^\r]\n/\r\n/g' | jq -s '.' | cat > output.json
Output Format
Results are written as NDJSON (newline-delimited JSON):
{"URL": "https://example.com", "Title": "Example Domain", "Links": 12}
{"URL": "https://example.com/about", "Title": "About Us", "Links": 9}
{"URL": "https://example.com/contact", "Title": "Contact", "Links": 48}
- With
--include-text:
{"URL": "https://example.com", "Title": "Example Domain", "Links": 27, "Text": "Example Domain\nThis domain is..."}
- With
--include-content:
{"URL": "https://example.com", "Title": "Example Domain", "Links": 30, "Content": "<!DOCTYPE html>\n<html>..."}
Logging
Log Levels:
- INFO (verbose mode only): Request logs
- WARN (always): Recoverable errors (404, network timeouts)
- FATAL (always): Unrecoverable errors (invalid URL, disk full)
Log Format:
2026-01-24 02:37:40.351 [INFO]: Sent request to URL: https://example.com
2026-01-24 02:37:41.123 [WARN]: Failed to fetch URL: https://example.com/broken-link
Cause: HTTP 404 Not Found
Examples
- Crawl Documentation Site:
crawn https://doc.rust-lang.org/book/ | cat output.ndjson
- Crawl with Logging to custom file:
crawn -l crawler.log -v https://example.com | jq -s '.'
- Limit to 2 Levels Deep:
crawn -m 2 https://example.com | cat > shallow.ndjson
Limitations
- Same-domain only (no external links by design)
- No JavaScript rendering (static HTML only)
- No authentication (public pages only)
Notes
- crawn is licensed under the MIT license.
- For specifics about contributing to crawn, see CONTRIBUTING.md.
- For new changes to crawn, see CHANGELOG.md.
Dependencies
~10–25MB
~325K SLoC