49 releases (25 stable)
Uses new Rust 2024
| new 4.2.15 | Feb 8, 2026 |
|---|---|
| 4.2.14 | Feb 7, 2026 |
| 4.2.6 | Jan 31, 2026 |
| 4.0.0-rc.23 | Dec 30, 2025 |
#170 in Parser implementations
967 downloads per month
Used in kreuzberg-cli
3MB
58K
SLoC
Kreuzberg
High-performance document intelligence library for Rust. Extract text, metadata, and structured information from PDFs, Office documents, images, and 56 formats.
This is the core Rust library that powers the Python, TypeScript, and Ruby bindings.
🚀 Version 4.2.15 Release This is a pre-release version. We invite you to test the library and report any issues you encounter.
Note: The Rust crate is not currently published to crates.io for this RC. Use git dependencies or language bindings (Python, TypeScript, Ruby) instead.
Installation
[dependencies]
kreuzberg = "4.0"
tokio = { version = "1", features = ["rt", "macros"] }
PDFium Linking Options
Kreuzberg offers flexible PDFium linking strategies for different deployment scenarios. Note: Language bindings (Python, TypeScript, Ruby, Java, Go, C#, PHP, Elixir) automatically bundle PDFium—no configuration needed. This section applies only to the Rust crate.
| Strategy | Feature Flag | Description | Use Case |
|---|---|---|---|
| Default (Dynamic) | None | Links to system PDFium at runtime | Development, system package users |
| Static | pdf-static |
Statically links PDFium into binary | Single binary distribution, no runtime dependencies |
| Bundled | pdf-bundled |
Downloads and embeds PDFium in binary | CI/CD, hermetic builds, largest binary size |
| System | pdf-system |
Uses system PDFium via pkg-config | Linux distributions with PDFium package |
Example Cargo.toml configurations:
# Default (dynamic linking)
[dependencies]
kreuzberg = "4.0"
# Static linking
[dependencies]
kreuzberg = { version = "4.0", features = ["pdf-static"] }
# Bundled in binary
[dependencies]
kreuzberg = { version = "4.0", features = ["pdf-bundled"] }
# System library (requires PDFium installed)
[dependencies]
kreuzberg = { version = "4.0", features = ["pdf-system"] }
For more details on feature flags and configuration options, see the Features documentation.
System Requirements
ONNX Runtime (for embeddings)
If using embeddings functionality, ONNX Runtime must be installed:
# macOS
brew install onnxruntime
# Ubuntu/Debian
sudo apt install libonnxruntime libonnxruntime-dev
# Windows (MSVC)
scoop install onnxruntime
# OR download from https://github.com/microsoft/onnxruntime/releases
Without ONNX Runtime, embeddings will raise MissingDependencyError with installation instructions.
Quick Start
use kreuzberg::{extract_file_sync, ExtractionConfig};
fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig::default();
let result = extract_file_sync("document.pdf", None, &config)?;
println!("{}", result.content);
Ok(())
}
Async Extraction
use kreuzberg::{extract_file, ExtractionConfig};
#[tokio::main]
async fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig::default();
let result = extract_file("document.pdf", None, &config).await?;
println!("{}", result.content);
Ok(())
}
Batch Processing
use kreuzberg::{batch_extract_file, ExtractionConfig};
#[tokio::main]
async fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig::default();
let files = vec!["doc1.pdf", "doc2.pdf", "doc3.pdf"];
let results = batch_extract_file(&files, None, &config).await?;
for result in results {
println!("{}", result.content);
}
Ok(())
}
OCR with Table Extraction
use kreuzberg::{extract_file_sync, ExtractionConfig, OcrConfig, TesseractConfig};
fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig {
ocr: Some(OcrConfig {
backend: "tesseract".to_string(),
language: "eng".to_string(),
tesseract_config: Some(TesseractConfig {
enable_table_detection: true,
..Default::default()
}),
}),
..Default::default()
};
let result = extract_file_sync("invoice.pdf", None, &config)?;
for table in &result.tables {
println!("{}", table.markdown);
}
Ok(())
}
Password-Protected PDFs
use kreuzberg::{extract_file_sync, ExtractionConfig, PdfConfig};
fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig {
pdf_options: Some(PdfConfig {
passwords: Some(vec!["password1".to_string(), "password2".to_string()]),
..Default::default()
}),
..Default::default()
};
let result = extract_file_sync("protected.pdf", None, &config)?;
Ok(())
}
Extract from Bytes
use kreuzberg::{extract_bytes_sync, ExtractionConfig};
use std::fs;
fn main() -> kreuzberg::Result<()> {
let data = fs::read("document.pdf")?;
let config = ExtractionConfig::default();
let result = extract_bytes_sync(&data, "application/pdf", &config)?;
println!("{}", result.content);
Ok(())
}
Features
The crate uses feature flags for optional functionality:
[dependencies]
kreuzberg = { version = "4.0", features = ["pdf", "excel", "ocr"] }
Available Features
| Feature | Description | Binary Size |
|---|---|---|
pdf |
PDF extraction via pdfium | +25MB |
excel |
Excel/spreadsheet parsing | +3MB |
office |
DOCX, PPTX extraction | +1MB |
email |
EML, MSG extraction | +500KB |
html |
HTML to markdown | +1MB |
xml |
XML streaming parser | +500KB |
archives |
ZIP, TAR, 7Z extraction | +2MB |
ocr |
OCR with Tesseract | +5MB |
language-detection |
Language detection | +100KB |
chunking |
Text chunking | +200KB |
quality |
Text quality processing | +500KB |
Feature Bundles
kreuzberg = { version = "4.0", features = ["full"] }
kreuzberg = { version = "4.0", features = ["server"] }
kreuzberg = { version = "4.0", features = ["cli"] }
PDF Support and Linking Options
Kreuzberg supports three PDFium linking strategies. Default is bundled-pdfium (best developer experience).
| Strategy | Feature | Use Case | Binary Size | Runtime Deps |
|---|---|---|---|---|
| Bundled (default) | bundled-pdfium |
Development, production | +8-15MB | None |
| Static | static-pdfium |
Docker, musl, standalone binaries | +200MB | None |
| System | system-pdfium |
Package managers, distros | +2MB | libpdfium.so |
Quick Start
# Default - bundled PDFium (recommended)
[dependencies]
kreuzberg = "4.0"
# Static linking (Docker, musl)
[dependencies]
kreuzberg = { version = "4.0", features = ["static-pdfium"] }
# System PDFium (package managers)
[dependencies]
kreuzberg = { version = "4.0", features = ["system-pdfium"] }
For detailed information, see the PDFium Linking Guide.
Note: Language bindings (Python, TypeScript, Ruby, Java, Go) automatically bundle PDFium. No configuration needed.
Documentation
API Documentation – Complete API reference with examples
https://docs.kreuzberg.dev – User guide and tutorials
License
MIT License - see LICENSE for details.
Dependencies
~16–67MB
~1M SLoC