#docx #ocr #pdf #text-extraction #file-format #text-file #parser

parser-core

A library for extracting text from various file formats including PDF, DOCX, XLSX, PPTX, images via OCR, and more

3 releases

new 0.1.3 Mar 19, 2025
0.1.1 Mar 17, 2025
0.1.0 Mar 17, 2025

#10 in #docx

Download history 237/week @ 2025-03-14

237 downloads per month
Used in 2 crates

MIT license

5.5MB
492 lines

Parser Core

The core engine of the parser project, providing functionality for extracting text from various file formats.

Features

  • Parse a wide variety of document formats:
    • PDF files (.pdf)
    • Office documents (.docx, .xlsx, .pptx)
    • Plain text files (.txt, .csv, .json)
    • Images with OCR (.png, .jpg, .webp)
  • Automatic format detection
  • Parallel processing support via Rayon

Dependencies

This package requires the following system libraries:

  • Tesseract OCR - Used for image text extraction
  • Leptonica - Image processing library used by Tesseract
  • Clang - Required for some build dependencies

Installation on Debian/Ubuntu

sudo apt install libtesseract-dev libleptonica-dev libclang-dev

Installation on macOS

brew install tesseract

Installation on Windows

Follow the instructions at Tesseract GitHub repository.

Usage

Add as a dependency in your Cargo.toml:

cargo add parser-core

Basic usage:

use parser_core::parse;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Read a file
    let data = std::fs::read("document.pdf")?;
    
    // Parse the document
    let text = parse(&data)?;
    
    println!("Extracted text: {}", text);
    
    Ok(())
}

Architecture

The crate is organized around a central parse function that:

  1. Detects the MIME type of the provided data
  2. Routes to the appropriate parser module
  3. Returns the extracted text

Each parser is implemented in its own module:

  • docx.rs - Microsoft Word documents
  • pdf.rs - PDF documents
  • xlsx.rs - Microsoft Excel spreadsheets
  • pptx.rs - Microsoft PowerPoint presentations
  • text.rs - Plain text formats, including CSV and JSON
  • image.rs - Image formats using OCR

Development

Testing

Run tests with:

cargo test

Benchmarking

Benchmark sequential vs. parallel parsing:

cargo bench

Dependencies

~34–47MB
~732K SLoC