22 stable releases
| 1.0.25 | Feb 7, 2026 |
|---|---|
| 1.0.21 | Jan 5, 2026 |
| 1.0.18 | Sep 14, 2025 |
| 1.0.16 | Jun 23, 2025 |
| 1.0.3 | Nov 30, 2024 |
#2325 in Parser implementations
250KB
4.5K
SLoC
Rust Research Paper Parser (rsrpp)
RuSt Research Paper Parser (rsrpp)
The rsrpp library provides a set of tools for parsing research papers.
Quick Start
Pre-requirements
- Poppler:
sudo apt install poppler-utils - OpenCV:
sudo apt install libopencv-dev clang libclang-dev
Installation
To start using the rsrpp library, add it to your project's dependencies in the Cargo.toml file:
cargo install rsrpp-cli
rsrpp --help
A Rust project for research paper pdf.
Usage: rsrpp [OPTIONS] --pdf <PDF>
Options:
-p, --pdf <PDF>
-o, --out <OUT>
-h, --help Print help
-V, --version Print version
Releases
1.0.25
- LLM-enhanced processing is now enabled by default
- If
OPENAI_API_KEYis not set, LLM is automatically disabled at runtime - Use
--no-llmto explicitly disable
- If
- Fixed LLM section validation discarding sections from pages the LLM hadn't examined
1.0.24
- Fixed body text loss in Nature-format and non-standard papers:
- Added section detection fallback for papers without "Abstract" heading
- Capped table detection regions at 50% of page area to reject false positives
- Improved math extraction accuracy:
- Fixed LLM math extraction alignment bug
- Reduced false positives and false negatives in heuristic math detection
- Unified math output to LaTeX format inside
<math>tags
1.0.21
- Fixed panic-causing unwrap() calls with proper error handling.
1.0.20
- Fixed Poppler 25.12.0 compatibility on macOS.
1.0.19
- Refactored
fix_suffix_hyphensto support 31 compound word suffixes:-based,-driven,-oriented,-aware,-agnostic,-independent,-dependent,-first,-native,-centric,-intensive,-bound,-safe,-free,-proof,-efficient,-optimized,-enabled,-powered,-ready,-capable,-compatible,-compliant,-level,-scale,-wide,-specific,-friendly,-facing,-like,-style
- Added unit tests for suffix hyphenation functionality.
1.0.18
- updated how to extract section titles from PDF.
1.0.17
- restructured
rsrpp.parser. - updated how to extract section titles from PDF.
- updated tests.
1.0.16
- removed
init_loggerformrsrpp.
1.0.15
- fixed typo.
- introdeced
tracinglogger.
1.0.14
- Updated
rsrppversion forrsrpp-cli.
- Fixed a bug in xml loop to finish when the file reaches to end.
1.0.10
- Added verbose mode.
- Fixed a bug in the process extracting page number.
1.0.9
- Updated: implemented new errors to handle invalid URLs.
1.0.8
- Update: The max retry time for saving PDF files has been increased.
1.0.7
- Fix bugs: After converting to PDF, the program now waits until processing is complete.
1.0.4
- Fixed bugs in
get_pdf_info. - Made minor improvements.
1.0.3
- Added cli -> rsrpp-cli.
1.0.2
- Updated the
Sectionmodule.content: Stringwas replaced bycontent: Vec<TextBlock>.
Dependencies
~50–83MB
~1M SLoC