#markdown-converter #render-markdown #html-converter #html-markdown-converter #html #web #scenario

bin+lib twars-url2md

A powerful CLI tool that fetches web pages and converts them to clean Markdown format using Monolith for content extraction and htmd for conversion

14 stable releases

new 1.4.2 Apr 7, 2025
1.3.6 Feb 4, 2025

#577 in Text processing

Download history 542/week @ 2025-01-29 197/week @ 2025-02-05 26/week @ 2025-02-12 177/week @ 2025-04-02

177 downloads per month

MIT license

67KB
1.5K SLoC

twars-url2md

Crates.io GitHub Release Date GitHub commits since latest release License: MIT

twars-url2md is a fast and robust command-line tool written in Rust that fetches web pages, cleans up their HTML content, and converts them into clean Markdown.

You can drop a text that contains URLs onto the app, and it will find all the URLs and save Markdown versions of the pages in a logical folder structure. The output is not perfect, but the tool is fast and robust.

1. Table of Contents

2. Features

2.1. Powerful Web Content Conversion

  • Extracts clean web content using Monolith
  • Converts web pages to Markdown efficiently
  • Handles complex URL and encoding scenarios

2.2. Smart URL Handling

  • Extracts URLs from various text formats
  • Resolves and validates URLs intelligently
  • Supports base URL and relative link processing
  • NEW: Processes local HTML files in addition to remote URLs

2.3. Flexible Input & Output

  • Multiple input methods (file, stdin, CLI)
  • Organized Markdown file generation
  • Cross-platform compatibility
  • NEW: Option to pack all Markdown outputs into a single combined file

2.4. Advanced Processing

  • Parallel URL processing
  • Robust error handling
  • Exponential backoff retry mechanism for network requests

3. Installation

3.1. Download Pre-compiled Binaries

The easiest way to get started is to download the pre-compiled binary for your platform.

  1. Visit the releases page
  2. Download the appropriate file for your system:
    • macOS: twars-url2md-macos-universal.tar.gz (works on both Intel and Apple Silicon)
    • Windows: twars-url2md-windows-x86_64.exe.zip
    • Linux: twars-url2md-linux-x86_64.tar.gz
  3. Extract the archive:
    • macOS/Linux: tar -xzf twars-url2md-*.tar.gz
    • Windows: Extract the zip file using Explorer or any archive utility
  4. Make the binary executable (macOS/Linux only): chmod +x twars-url2md
  5. Move the binary to a location in your PATH:
    • macOS/Linux: sudo mv twars-url2md /usr/local/bin/ or mv twars-url2md ~/.local/bin/
    • Windows: Move to a folder in your PATH or add the folder to your PATH

3.2. Install from Crates.io

If you have Rust installed (version 1.70.0 or later), you can install directly from crates.io:

cargo install twars-url2md

3.3. Build from Source

For the latest version or to customize the build:

# Clone the repository
git clone https://github.com/twardoch/twars-url2md.git
cd twars-url2md

# Build and install
cargo build --release
mv target/release/twars-url2md /usr/local/bin/  # or any location in your PATH

4. Usage

4.1. Command Line Options

Usage: twars-url2md [OPTIONS]

Options:
  -i, --input <FILE>       Input file containing URLs or local file paths (one per line)
  -o, --output <DIR>       Output directory for markdown files
      --stdin              Read URLs from standard input
      --base-url <URL>     Base URL for resolving relative links
  -p, --pack <FILE>        Output file to pack all markdown files together
  -v, --verbose            Enable verbose output
  -h, --help               Print help
  -V, --version            Print version

4.2. Input Options

The tool accepts URLs and local file paths from:

  • A file specified with --input
  • Standard input with --stdin
  • Note: Either --input or --stdin must be specified

4.3. Output Options

  • --output <DIR>: Create individual Markdown files in this directory
  • --pack <FILE>: Combine all Markdown files into a single output file
  • You can use both options together

4.4. Processing Local Files

You can now include local HTML files in your input:

  • Absolute paths: /path/to/file.html
  • File URLs: file:///path/to/file.html
  • Mix of local files and remote URLs in the same input

5. Examples

5.1. Basic Usage

# Process a single URL and print to stdout
echo "https://example.com" | twars-url2md --stdin

# Process URLs from a file with specific output directory
twars-url2md --input urls.txt --output ./markdown_output

# Process piped URLs with base URL for relative links
cat urls.txt | twars-url2md --stdin --base-url "https://example.com" --output ./output

# Show verbose output
twars-url2md --input urls.txt --output ./output --verbose

5.2. Using the Pack Option

# Process URLs and create a combined Markdown file
twars-url2md --input urls.txt --pack combined.md

# Both individual files and a combined file
twars-url2md --input urls.txt --output ./output --pack combined.md

5.3. Processing Local Files

# Create a test HTML file
echo "<html><body><h1>Test</h1><p>Content</p></body></html>" > test.html

# Process a local HTML file
echo "$PWD/test.html" > local_paths.txt
twars-url2md --input local_paths.txt --output ./output

# Mix local and remote content
cat > mixed.txt << EOF
https://example.com
file://$PWD/test.html
EOF
twars-url2md --input mixed.txt --pack combined.md

5.4. Batch Processing

# Extract and process links from a webpage
curl "https://en.wikipedia.org/wiki/Rust_(programming_language)" | twars-url2md --stdin --output rust_wiki/

# Process multiple files
find ./html_files -name "*.html" > files_to_process.txt
twars-url2md --input files_to_process.txt --output ./markdown_output --pack all_content.md

6. Output Organization

The tool organizes output into a directory structure based on the URLs:

output/
├── example.com/
│   ├── index.md       # from https://example.com/
│   └── articles/
│       └── page.md    # from https://example.com/articles/page
└── another-site.com/
    └── post/
        └── article.md # from https://another-site.com/post/article

For local files, the directory structure mirrors the file path.

7. Development

7.1. Running Tests

# Run all tests
cargo test

# Run with specific features
cargo test --all-features

# Run specific test
cargo test test_name

7.2. Code Quality Tools

  • Formatting: cargo fmt
  • Linting: cargo clippy --all-targets --all-features

7.3. Publishing

To publish a new release of twars-url2md:

7.3.1. Prepare for Release

# Update version in Cargo.toml (e.g. from 1.3.6 to 1.3.7)
# Ensure everything works
cargo test
cargo clippy --all-targets --all-features
cargo fmt --check

7.3.2. Build Locally

# Build in release mode
cargo build --release

# Test the binary
./target/release/twars-url2md --help

7.3.3. Publish to Crates.io

# Login to crates.io (if not already logged in)
cargo login

# Verify the package
cargo package

# Publish
cargo publish

7.3.4. Create GitHub Release

# Create and push a tag matching your version
git tag -a v1.3.7 -m "Release v1.3.7"
git push origin v1.3.7

The configured GitHub Actions workflow (.github/workflows/ci.yml) will automatically:

  • Run tests on the tag
  • Create a GitHub Release
  • Build binaries for macOS, Windows, and Linux
  • Upload the binaries to the release
  • Publish to crates.io

7.3.5. Manual Release (Alternative)

If GitHub Actions fails, you can create the release manually:

  1. Go to GitHub repository → Releases → Create a new release
  2. Select your tag
  3. Build platform-specific binaries:
# macOS universal binary
cargo build --release --target x86_64-apple-darwin
cargo build --release --target aarch64-apple-darwin
lipo "target/x86_64-apple-darwin/release/twars-url2md" "target/aarch64-apple-darwin/release/twars-url2md" -create -output "target/twars-url2md"
tar czf twars-url2md-macos-universal.tar.gz -C target twars-url2md

# Linux
cargo build --release --target x86_64-unknown-linux-gnu
tar czf twars-url2md-linux-x86_64.tar.gz -C target/x86_64-unknown-linux-gnu/release twars-url2md

# Windows
cargo build --release --target x86_64-pc-windows-msvc
cd target/x86_64-pc-windows-msvc/release
7z a ../../../twars-url2md-windows-x86_64.zip twars-url2md.exe
  1. Upload these files to your GitHub release

7.3.6. Verify the Release

  • Check that the release appears on GitHub
  • Verify that binary files are attached to the release
  • Confirm the new version appears on crates.io
  • Try installing the new version: cargo install twars-url2md

8. License

MIT License - see LICENSE for details.

9. Author

Adam Twardoch (@twardoch)


For bug reports, feature requests, or general questions, please open an issue on the GitHub repository.

Dependencies

~24–45MB
~838K SLoC