#parser #text-format #format #rich #text #rtf

rtf-parser

A Rust RTF parser & lexer library designed for speed and memory efficiency

14 releases

0.2.2 Apr 13, 2024
0.2.1 Apr 7, 2024
0.2.0 Mar 12, 2024
0.1.10 Mar 11, 2024
0.1.1 Oct 29, 2023

#308 in Parser implementations

Download history 12/week @ 2024-01-26 4/week @ 2024-02-02 3/week @ 2024-02-09 247/week @ 2024-02-16 58/week @ 2024-02-23 90/week @ 2024-03-01 277/week @ 2024-03-08 176/week @ 2024-03-15 20/week @ 2024-03-22 27/week @ 2024-03-29 132/week @ 2024-04-05 196/week @ 2024-04-12 13/week @ 2024-04-19

368 downloads per month

MIT license

68KB
1.5K SLoC

rtf-parser

Crates.io Crates.io License Crates.io Total Downloads docs.rs

A safe Rust RTF parser & lexer library designed for speed and memory efficiency, with no external dependencies. The official documentation is available at docs.rs/rtf-parser.

Installation

This library can be installed using cargo with the CLI :

 cargo add rtf-parser

Or adding rtf-parser = "<last-version>" under [dependencies] in your Cargo.toml.

Design

The library is split into 2 main components:

  1. The lexer
  2. The parser

The lexer scans the document and returns a Vec<Token> which represent the RTF file in a code-understandable manner. These tokens can then be passed to the parser to transcript it to a real document : RtfDocument.

use rtf_parser::lexer::Lexer;
use rtf_parser::tokens::Token;
use rtf_parser::parser::Parser;
use rtf_parser::document::RtfDocument;

fn main() -> Result<(), Box<dyn Error>> {
    let tokens: Vec<Token> = Lexer::scan("<rtf>")?;
    let parser = Parser::new(tokens);
    let doc: RtfDocument = parser.parse()?;    
}

or in a more concise way :

use rtf_parser::document::RtfDocument;

fn main() -> Result<(), Box<dyn Error>> {
    let doc: RtfDocument = RtfDocument::try_from("<rtf>")?;    
}

The RtfDocument struct implement the TryFrom trait for :

  • &str
  • String
  • &mut std::fs::File

and a from_filepath constructor that handle the i/o internally.

The error returned can be a LexerError or a ParserError depending on the phase wich failed.

An RtfDocument is composed with :

  • the header, containing among others the font table, the color table and the encoding.
  • the body, which is a Vec<StyledBlock>

A StyledBlock contains all the information about the formatting of a specific block of text.
It contains a Painter for the text style, a Paragraph for the layout, and the text (String). The Painter is defined below, and the rendering implementation depends on the user.

pub struct Painter {
    pub font_ref: FontRef,
    pub font_size: u16,
    pub bold: bool,
    pub italic: bool,
    pub underline: bool,
    pub superscript: bool,
    pub subscript: bool,
    pub smallcaps: bool,
    pub strike: bool,
}

The layout information are exposed in the paragraph property :

pub struct Paragraph {
    pub alignment: Alignment,
    pub spacing: Spacing,
    pub indent: Indentation,
    pub tab_width: i32,
}

It defined the way a block is aligned, what spacing it uses, etc...

You also can extract the text without any formatting information, with the to_text() method of the RtfDocument struct.

fn main() -> Result<(), Box<dyn Error>> {
    let rtf = r#"{\rtf1\ansi{\fonttbl\f0\fswiss Helvetica;}\f0\pard Voici du texte en {\b gras}.\par}"#;
    let tokens = Lexer::scan(rtf)?;
    let document = Parser::new(tokens)?;
    let text = document.to_text();
    assert_eq!(text, "Voici du texte en gras.");
}

Examples

A complete example of rtf parsing is presented below :

use rtf_parser::lexer::Lexer;
use rtf_parser::parser::Parser;

fn main() -> Result<(), Box<dyn Error>> {
    let rtf_text = r#"{ \rtf1\ansi{\fonttbl\f0\fswiss Helvetica;}\f0\pard Voici du texte en {\b gras}.\par }"#;
    let tokens = Lexer::scan(rtf_text)?;
    let doc = Parser::new(tokens).parse()?;
    assert_eq!(
        doc.header,
        RtfHeader {
            character_set: Ansi,
            color_table: ColorTable::Default(),
            font_table: FontTable::from([
                (0, Font { name: "Helvetica", character_set: 0, font_family: Swiss })
            ])
        }
    );
    assert_eq!(
        doc.body,
        [
            StyleBlock {
                painter: Painter { font_ref: 0, font_size: 0, bold: false, italic: false, underline: false },
                paragraph: Paragraph {
                    alignment: LeftAligned,
                    spacing: Spacing { before: 0, after: 0, between_line: Auto, line_multiplier: 0, },
                    indent: Indentation { left: 0, right: 0, first_line: 0, },
                    tab_width: 0,
                },
                text: "Voici du texte en ",
            },
            StyleBlock {
                painter: Painter { font_ref: 0, font_size: 0, bold: true, italic: false, underline: false },
                paragraph: Paragraph {
                    alignment: LeftAligned,
                    spacing: Spacing { before: 0, after: 0, between_line: Auto, line_multiplier: 0, },
                    indent: Indentation { left: 0, right: 0, first_line: 0, },
                    tab_width: 0,
                },
                text: "gras",
            },
            StyleBlock {
                painter: Painter { font_ref: 0, font_size: 0, bold: false, italic: false, underline: false },
                paragraph: Paragraph {
                    alignment: LeftAligned,
                    spacing: Spacing { before: 0, after: 0, between_line: Auto, line_multiplier: 0, },
                    indent: Indentation { left: 0, right: 0, first_line: 0, },
                    tab_width: 0,
                },
                text: ".",
            },
        ]
    );
    return Ok(());
}

Benchmark

For now, there is no comparable crates to rtf-parser.
However, the rtf-grimoire crate provide a similar Lexer. Here is a quick benchmark of the lexing and parsing of a 500kB rtf docuement.

Crate Version Duration
rtf-parser v0.2.2 30 ms
rtf-grimoire (only lexing) v0.2.1 123 ms

This benchmark has been made on an Intel MacBook Pro.
For the rtf-parser, most of the compute time (65 %) is spent by the lexing process. There is still lot of room for improvement.

Dependencies

~1–1.4MB
~35K SLoC