3 releases
new 0.1.3 | Mar 19, 2025 |
---|---|
0.1.1 | Mar 17, 2025 |
0.1.0 | Mar 17, 2025 |
#10 in #docx
237 downloads per month
Used in 2 crates
5.5MB
492 lines
Parser Core
The core engine of the parser project, providing functionality for extracting text from various file formats.
Features
- Parse a wide variety of document formats:
- PDF files (
.pdf
) - Office documents (
.docx
,.xlsx
,.pptx
) - Plain text files (
.txt
,.csv
,.json
) - Images with OCR (
.png
,.jpg
,.webp
)
- PDF files (
- Automatic format detection
- Parallel processing support via Rayon
Dependencies
This package requires the following system libraries:
- Tesseract OCR - Used for image text extraction
- Leptonica - Image processing library used by Tesseract
- Clang - Required for some build dependencies
Installation on Debian/Ubuntu
sudo apt install libtesseract-dev libleptonica-dev libclang-dev
Installation on macOS
brew install tesseract
Installation on Windows
Follow the instructions at Tesseract GitHub repository.
Usage
Add as a dependency in your Cargo.toml
:
cargo add parser-core
Basic usage:
use parser_core::parse;
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Read a file
let data = std::fs::read("document.pdf")?;
// Parse the document
let text = parse(&data)?;
println!("Extracted text: {}", text);
Ok(())
}
Architecture
The crate is organized around a central parse
function that:
- Detects the MIME type of the provided data
- Routes to the appropriate parser module
- Returns the extracted text
Each parser is implemented in its own module:
docx.rs
- Microsoft Word documentspdf.rs
- PDF documentsxlsx.rs
- Microsoft Excel spreadsheetspptx.rs
- Microsoft PowerPoint presentationstext.rs
- Plain text formats, including CSV and JSONimage.rs
- Image formats using OCR
Development
Testing
Run tests with:
cargo test
Benchmarking
Benchmark sequential vs. parallel parsing:
cargo bench
Dependencies
~34–47MB
~732K SLoC