16 releases (7 breaking)
| new 0.8.0 | Apr 8, 2026 |
|---|---|
| 0.7.0 | Mar 1, 2026 |
| 0.6.1 | Mar 1, 2026 |
| 0.5.0 | Mar 1, 2026 |
| 0.1.0 | Feb 19, 2026 |
#607 in Text processing
911 downloads per month
310KB
5K
SLoC
edgequake-pdf2md
Convert PDF documents to clean Markdown using Vision Language Models
edgequake-pdf2md is a Rust CLI and library that converts PDF files (local or URL) into well-structured Markdown using vision-capable LLMs. It rasterises each page with pdfium, sends the image to a VLM (GPT-4.1, Claude, Gemini, etc.), and post-processes the result into clean Markdown.
Inspired by pyzerox, rebuilt in Rust for speed and reliability.
Features
- Multi-provider — AWS Bedrock (default), OpenAI, Anthropic, Google Gemini, Mistral AI, Azure, Ollama, or any OpenAI-compatible endpoint
- Fast — concurrent page processing with configurable parallelism
- Accurate — 10-rule post-processing pipeline fixes tables, removes hallucinations, normalises output
- Flexible — page selection, fidelity tiers, custom system prompts, streaming API
- Self-contained — pdfium (~5 MB) embedded in the binary by default; no runtime downloads, no env vars
- Cross-platform — macOS (arm64/x64), Linux (x64/aarch64), Windows (x64/arm64)
- Library + CLI — use as a Rust crate or standalone command-line tool
Quick Start
Self-contained binary — zero runtime setup. Starting from v0.4.0, the PDFium engine (~5 MB) is embedded inside the binary at compile time (
bundledfeature is now the default). No download required at runtime, noDYLD_LIBRARY_PATH, no environment variables needed.Build-time auto-download: if you don't set
PDFIUM_BUNDLE_LIB, the correct pdfium library is downloaded automatically duringcargo buildand cached in~/.cargo/pdfium-bundle/. UsePDFIUM_LIB_PATHto point to an existing copy at runtime (download mode, withoutbundledfeature).
1. Set credentials
# AWS Bedrock (recommended — cheapest, default provider)
export AWS_ACCESS_KEY_ID="AKIA..."
export AWS_SECRET_ACCESS_KEY="..."
export AWS_REGION="eu-west-1" # optional, default: us-east-1
# or
export OPENAI_API_KEY="sk-..." # OpenAI
# or
export ANTHROPIC_API_KEY="sk-ant-..." # Anthropic
# or
export GEMINI_API_KEY="AI..." # Google Gemini
# or
export MISTRAL_API_KEY="..." # Mistral AI (uses pixtral-12b-2409)
2. Build & run
cargo build --release
# Convert a PDF
./target/release/pdf2md document.pdf -o output.md
# Convert from URL
./target/release/pdf2md https://arxiv.org/pdf/1706.03762 -o paper.md
# Inspect metadata (no API key needed)
./target/release/pdf2md --inspect-only document.pdf
Or install globally:
cargo install edgequake-pdf2md
pdf2md document.pdf -o output.md
How It Works
PDF ──▶ pdfium ──▶ PNG images ──▶ base64 ──▶ VLM API ──▶ post-process ──▶ Markdown
render per page encode (concurrent) 10 rules assembled
- Input — resolve local file or download from URL
- Render — rasterise pages to images via pdfium-render
- Encode — base64-encode each page image
- VLM — send images to a vision LLM with a structured system prompt
- Post-process — strip fences, fix tables, remove hallucinated images, normalise whitespace
- Assemble — join pages with optional separators and YAML front-matter
See docs/how-it-works.md for the full pipeline walkthrough with diagrams.
Usage
# Basic conversion (uses AWS Bedrock by default)
pdf2md document.pdf -o output.md
# Specific pages
pdf2md --pages 1-5 document.pdf -o first_five.md
# High fidelity with a better model
pdf2md --fidelity tier3 --model gpt-4.1 --provider openai --dpi 200 paper.pdf -o paper.md
# Consistent formatting across pages (sequential mode)
pdf2md --maintain-format --separator hr book.pdf -o book.md
# JSON output with metadata
pdf2md --json --metadata document.pdf > output.json
# Use a different Bedrock model
pdf2md --provider bedrock --model amazon.nova-pro-v1:0 document.pdf
# Use Anthropic
pdf2md --provider anthropic --model claude-sonnet-4-20250514 document.pdf
# Use Mistral (pixtral-12b-2409 auto-selected as vision model)
export MISTRAL_API_KEY=your-key
pdf2md document.pdf
# or explcitly:
pdf2md --provider mistral --model pixtral-12b-2409 document.pdf
# Use local Ollama
pdf2md --provider ollama --model llava document.pdf
# Resumable conversion with checkpoints (v0.7)
pdf2md --checkpoint-dir ./checkpoints big-doc.pdf -o out.md
# Resume after interruption (re-run the same command)
pdf2md --checkpoint-dir ./checkpoints big-doc.pdf -o out.md
# Force fresh conversion, clearing existing checkpoints
pdf2md --checkpoint-dir ./checkpoints --no-resume big-doc.pdf -o out.md
Run pdf2md --help for the full reference, including supported models and cost estimates.
See docs/examples.md for more usage patterns.
Supported Providers & Models
| Provider | Model | Input $/1M | Output $/1M | Vision |
|---|---|---|---|---|
| Bedrock | amazon.nova-lite-v1:0 (default) | $0.06 | $0.24 | ✓ |
| Bedrock | amazon.nova-pro-v1:0 | $0.80 | $3.20 | ✓ |
| OpenAI | gpt-4.1-nano | $0.10 | $0.40 | ✓ |
| OpenAI | gpt-4.1-mini | $0.40 | $1.60 | ✓ |
| OpenAI | gpt-4.1 | $2.00 | $8.00 | ✓ |
| Anthropic | claude-sonnet-4-20250514 | $3.00 | $15.00 | ✓ |
| Anthropic | claude-haiku-4-20250514 | $0.80 | $4.00 | ✓ |
| Gemini | gemini-2.0-flash | $0.10 | $0.40 | ✓ |
| Gemini | gemini-2.5-pro | $1.25 | $10.00 | ✓ |
| Mistral | pixtral-12b-2409 | $0.15 | $0.15 | ✓ |
| Ollama | llava, llama3.2-vision | free | free | ✓ |
Cost estimate: A 50-page document costs ~$0.01 with amazon.nova-lite-v1:0, ~$0.02 with gpt-4.1-nano.
See docs/providers.md for detailed comparisons, cost calculators, and selection guide.
Library Usage
Add to your Cargo.toml:
[dependencies]
edgequake-pdf2md = "0.7"
tokio = { version = "1", features = ["full"] }
Basic conversion
use edgequake_pdf2md::{convert, ConversionConfig};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let config = ConversionConfig::builder()
.model("amazon.nova-lite-v1:0")
.provider_name("bedrock")
.pages(edgequake_pdf2md::PageSelection::Range(1, 5))
.build()?;
let output = convert("document.pdf", &config).await?;
println!("{}", output.markdown);
println!("Processed {}/{} pages", output.stats.processed_pages, output.stats.total_pages);
Ok(())
}
Convert PDF bytes in memory (v0.2)
No temp-file management needed — pass raw bytes directly:
use edgequake_pdf2md::{convert_from_bytes, ConversionConfig};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let bytes = std::fs::read("document.pdf")?; // or from DB / network
let config = ConversionConfig::default();
let output = convert_from_bytes(&bytes, &config).await?;
println!("{}", output.markdown);
Ok(())
}
Per-page progress callbacks (v0.2)
use edgequake_pdf2md::{convert, ConversionConfig, ConversionProgressCallback};
use std::sync::Arc;
struct MyProgress;
impl ConversionProgressCallback for MyProgress {
fn on_conversion_start(&self, total: usize) {
eprintln!("Starting conversion of {total} pages");
}
fn on_page_complete(&self, page: usize, total: usize, chars: usize) {
eprintln!(" ✓ Page {page}/{total} — {chars} chars");
}
fn on_page_error(&self, page: usize, total: usize, error: String) {
eprintln!(" ✗ Page {page}/{total} failed: {error}");
}
fn on_conversion_complete(&self, total: usize, success: usize) {
eprintln!("Done: {success}/{total} pages converted");
}
}
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let config = ConversionConfig::builder()
.progress_callback(Arc::new(MyProgress) as Arc<dyn ConversionProgressCallback>)
.build()?;
let output = convert("document.pdf", &config).await?;
println!("{}", output.markdown);
Ok(())
}
Strict error on partial failure (v0.2)
By default, page failures are non-fatal. Use into_result() to promote them to errors:
use edgequake_pdf2md::{convert, ConversionConfig};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let config = ConversionConfig::default();
// into_result() returns Err(PartialFailure) if any pages failed
let output = convert("document.pdf", &config).await?.into_result()?;
println!("{}", output.markdown);
Ok(())
}
Resumable conversions with checkpoints (v0.7)
Enable page-level checkpointing for large documents (500–2000+ pages). If a conversion is interrupted, re-running with the same settings resumes from where it left off — already-completed pages are loaded instantly from the checkpoint store, skipping render + VLM calls entirely.
use edgequake_pdf2md::{convert, ConversionConfig, FileCheckpointStore};
use std::sync::Arc;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let store = Arc::new(FileCheckpointStore::new("./checkpoints"));
let config = ConversionConfig::builder()
.checkpoint_store(store)
.build()?;
let output = convert("big-document.pdf", &config).await?;
println!(
"Processed {} pages ({} resumed from checkpoint)",
output.stats.processed_pages,
output.stats.resumed_pages,
);
Ok(())
}
Checkpoints are keyed by a deterministic conversion ID derived from the PDF content, provider, model, fidelity, and DPI. Changing any setting creates a separate checkpoint set. Checkpoints are automatically cleared when all pages succeed.
Provider injection (v0.2)
Pass a pre-built Arc<dyn LLMProvider> directly — useful for sharing providers
across multiple conversions and for testing with mocks:
use edgequake_pdf2md::{convert, ConversionConfig};
use edgequake_llm::ProviderFactory;
use std::sync::Arc;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let (provider, _) = ProviderFactory::from_env()?;
let config = ConversionConfig::builder()
.provider(Arc::clone(&provider)) // injected; highest priority
.build()?;
let output = convert("document.pdf", &config).await?;
println!("{}", output.markdown);
Ok(())
}
Provider resolution order (highest-to-lowest priority):
config.provider— explicitArc<dyn LLMProvider>injectionconfig.provider_name+config.model— named providerEDGEQUAKE_LLM_PROVIDER+EDGEQUAKE_MODELenvironment variables- Auto-detect from credentials (
AWS_ACCESS_KEY_ID,OPENAI_API_KEY,ANTHROPIC_API_KEY,MISTRAL_API_KEY, …)
Also available: streaming API (convert_stream, convert_stream_from_bytes), sync wrapper (convert_sync), metadata inspection (inspect).
See API docs on docs.rs for the full API reference.
Configuration
All options can be set via CLI flags, environment variables, or the builder API:
| Flag | Env Variable | Default | Description |
|---|---|---|---|
--model |
EDGEQUAKE_MODEL |
amazon.nova-lite-v1:0 | VLM model |
--provider |
EDGEQUAKE_PROVIDER |
auto-detect | LLM provider |
--dpi |
PDF2MD_DPI |
150 | Rendering resolution (72–400) |
--pages |
PDF2MD_PAGES |
all | Page selection |
--fidelity |
PDF2MD_FIDELITY |
tier2 | Quality tier (tier1/tier2/tier3) |
-c, --concurrency |
PDF2MD_CONCURRENCY |
10 | Parallel API calls |
--maintain-format |
PDF2MD_MAINTAIN_FORMAT |
false | Sequential mode |
--separator |
PDF2MD_SEPARATOR |
none | Page separator |
--temperature |
PDF2MD_TEMPERATURE |
0.1 | LLM temperature |
--checkpoint-dir |
PDF2MD_CHECKPOINT_DIR |
— | Checkpoint directory |
--no-resume |
PDF2MD_NO_RESUME |
false | Clear checkpoints, fresh run |
See docs/configuration.md for the complete reference.
Development
# Setup
make setup # Check pdfium + API key
# Build
make build # Release binary
make build-dev # Debug binary
# Test
make test # Unit tests (no API key needed)
make test-e2e # Integration tests (needs API key)
make test-all # All tests
# Quality
make lint # Clippy
make fmt # Format code
make ci # format + lint + unit tests
# Try it
make demo # Convert sample page
make inspect-all # Inspect test PDFs
Documentation
| Document | Description |
|---|---|
| docs/how-it-works.md | Pipeline architecture with ASCII diagrams |
| docs/installation.md | Setup guide for all platforms |
| docs/providers.md | Supported models, pricing, selection guide |
| docs/configuration.md | All CLI flags and environment variables |
| docs/examples.md | Real-world usage examples |
Dependencies
| Crate | Purpose |
|---|---|
| pdfium-render | PDF rasterisation via Google's pdfium C++ library |
| edgequake-llm | Multi-provider LLM abstraction (AWS Bedrock, OpenAI, Anthropic, Gemini, Azure, Ollama, etc.) — v0.3.0+ |
| tokio | Async runtime |
| image | Image encoding (PNG/JPEG) |
| clap | CLI argument parsing |
Python users:
edgequake-litellmv0.1.3 (PyPI) is a drop-in LiteLLM replacement backed byedgequake-llm. It supports Azure OpenAI viamodel="azure/<deployment>".
External References
- pdfium — Google's open-source PDF rendering engine
- pdfium-binaries — Pre-built pdfium binaries for all platforms
- pyzerox — The Python project that inspired this tool
- OpenAI Vision API — Image understanding with GPT-4.1
- Anthropic Vision — Image understanding with Claude
- Google Gemini — Vision capabilities
License
Copyright 2026 Raphaël MANSUY
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
See LICENSE for the full text.
Dependencies
~64–86MB
~1M SLoC