#research-paper #pdf #parser

app rsrpp-cli

A Rust project for research paper pdf

22 stable releases

1.0.25 Feb 7, 2026
1.0.21 Jan 5, 2026
1.0.18 Sep 14, 2025
1.0.16 Jun 23, 2025
1.0.3 Nov 30, 2024

#2325 in Parser implementations

MIT license

250KB
4.5K SLoC

Rust Research Paper Parser (rsrpp)

CircleCI Crates.io Version

RuSt Research Paper Parser (rsrpp)

The rsrpp library provides a set of tools for parsing research papers.

LOGO

Quick Start

Pre-requirements

  • Poppler: sudo apt install poppler-utils
  • OpenCV: sudo apt install libopencv-dev clang libclang-dev

Installation

To start using the rsrpp library, add it to your project's dependencies in the Cargo.toml file:

cargo install rsrpp-cli
rsrpp --help
A Rust project for research paper pdf.

Usage: rsrpp [OPTIONS] --pdf <PDF>

Options:
  -p, --pdf <PDF>  
  -o, --out <OUT>  
  -h, --help       Print help
  -V, --version    Print version

Releases

1.0.25
  • LLM-enhanced processing is now enabled by default
    • If OPENAI_API_KEY is not set, LLM is automatically disabled at runtime
    • Use --no-llm to explicitly disable
  • Fixed LLM section validation discarding sections from pages the LLM hadn't examined
1.0.24
  • Fixed body text loss in Nature-format and non-standard papers:
    • Added section detection fallback for papers without "Abstract" heading
    • Capped table detection regions at 50% of page area to reject false positives
  • Improved math extraction accuracy:
    • Fixed LLM math extraction alignment bug
    • Reduced false positives and false negatives in heuristic math detection
    • Unified math output to LaTeX format inside <math> tags
1.0.21
  • Fixed panic-causing unwrap() calls with proper error handling.
1.0.20
  • Fixed Poppler 25.12.0 compatibility on macOS.
1.0.19
  • Refactored fix_suffix_hyphens to support 31 compound word suffixes:
    • -based, -driven, -oriented, -aware, -agnostic, -independent, -dependent, -first, -native, -centric, -intensive, -bound, -safe, -free, -proof, -efficient, -optimized, -enabled, -powered, -ready, -capable, -compatible, -compliant, -level, -scale, -wide, -specific, -friendly, -facing, -like, -style
  • Added unit tests for suffix hyphenation functionality.
1.0.18
  • updated how to extract section titles from PDF.
1.0.17
  • restructured rsrpp.parser.
  • updated how to extract section titles from PDF.
  • updated tests.
1.0.16
  • removed init_logger form rsrpp.
1.0.15
  • fixed typo.
  • introdeced tracing logger.
1.0.14
  • Updated rsrpp version for rsrpp-cli.
1.0.11
  • Fixed a bug in xml loop to finish when the file reaches to end.
1.0.10
  • Added verbose mode.
  • Fixed a bug in the process extracting page number.
1.0.9
  • Updated: implemented new errors to handle invalid URLs.
1.0.8
  • Update: The max retry time for saving PDF files has been increased.
1.0.7
  • Fix bugs: After converting to PDF, the program now waits until processing is complete.
1.0.4
  • Fixed bugs in get_pdf_info.
  • Made minor improvements.
1.0.3
1.0.2
  • Updated the Section module. content: String was replaced by content: Vec<TextBlock>.

Dependencies

~50–83MB
~1M SLoC