14 releases

0.2.2	Apr 13, 2024
0.2.1	Apr 7, 2024
0.2.0	Mar 12, 2024
0.1.10	Mar 11, 2024
0.1.1	Oct 29, 2023

#308 in Parser implementations

368 downloads per month

MIT license

68KB
1.5K SLoC

rtf-parser

A safe Rust RTF parser & lexer library designed for speed and memory efficiency, with no external dependencies. The official documentation is available at docs.rs/rtf-parser.

Installation

This library can be installed using cargo with the CLI :

 cargo add rtf-parser

Or adding rtf-parser = "<last-version>" under [dependencies] in your Cargo.toml.

Design

The library is split into 2 main components:

The lexer
The parser

The lexer scans the document and returns a Vec<Token> which represent the RTF file in a code-understandable manner. These tokens can then be passed to the parser to transcript it to a real document : RtfDocument.

use rtf_parser::lexer::Lexer;
use rtf_parser::tokens::Token;
use rtf_parser::parser::Parser;
use rtf_parser::document::RtfDocument;

fn main() -> Result<(), Box<dyn Error>> {
    let tokens: Vec<Token> = Lexer::scan("<rtf>")?;
    let parser = Parser::new(tokens);
    let doc: RtfDocument = parser.parse()?;    
}

or in a more concise way :

use rtf_parser::document::RtfDocument;

fn main() -> Result<(), Box<dyn Error>> {
    let doc: RtfDocument = RtfDocument::try_from("<rtf>")?;    
}

The RtfDocument struct implement the TryFrom trait for :

&str
String
&mut std::fs::File

and a from_filepath constructor that handle the i/o internally.

The error returned can be a LexerError or a ParserError depending on the phase wich failed.

An RtfDocument is composed with :

the header, containing among others the font table, the color table and the encoding.
the body, which is a Vec<StyledBlock>

A StyledBlock contains all the information about the formatting of a specific block of text.
It contains a Painter for the text style, a Paragraph for the layout, and the text (String). The Painter is defined below, and the rendering implementation depends on the user.

pub struct Painter {
    pub font_ref: FontRef,
    pub font_size: u16,
    pub bold: bool,
    pub italic: bool,
    pub underline: bool,
    pub superscript: bool,
    pub subscript: bool,
    pub smallcaps: bool,
    pub strike: bool,
}

The layout information are exposed in the paragraph property :

pub struct Paragraph {
    pub alignment: Alignment,
    pub spacing: Spacing,
    pub indent: Indentation,
    pub tab_width: i32,
}

It defined the way a block is aligned, what spacing it uses, etc...

You also can extract the text without any formatting information, with the to_text() method of the RtfDocument struct.

fn main() -> Result<(), Box<dyn Error>> {
    let rtf = r#"{\rtf1\ansi{\fonttbl\f0\fswiss Helvetica;}\f0\pard Voici du texte en {\b gras}.\par}"#;
    let tokens = Lexer::scan(rtf)?;
    let document = Parser::new(tokens)?;
    let text = document.to_text();
    assert_eq!(text, "Voici du texte en gras.");
}

Examples

A complete example of rtf parsing is presented below :

use rtf_parser::lexer::Lexer;
use rtf_parser::parser::Parser;

fn main() -> Result<(), Box<dyn Error>> {
    let rtf_text = r#"{ \rtf1\ansi{\fonttbl\f0\fswiss Helvetica;}\f0\pard Voici du texte en {\b gras}.\par }"#;
    let tokens = Lexer::scan(rtf_text)?;
    let doc = Parser::new(tokens).parse()?;
    assert_eq!(
        doc.header,
        RtfHeader {
            character_set: Ansi,
            color_table: ColorTable::Default(),
            font_table: FontTable::from([
                (0, Font { name: "Helvetica", character_set: 0, font_family: Swiss })
            ])
        }
    );
    assert_eq!(
        doc.body,
        [
            StyleBlock {
                painter: Painter { font_ref: 0, font_size: 0, bold: false, italic: false, underline: false },
                paragraph: Paragraph {
                    alignment: LeftAligned,
                    spacing: Spacing { before: 0, after: 0, between_line: Auto, line_multiplier: 0, },
                    indent: Indentation { left: 0, right: 0, first_line: 0, },
                    tab_width: 0,
                },
                text: "Voici du texte en ",
            },
            StyleBlock {
                painter: Painter { font_ref: 0, font_size: 0, bold: true, italic: false, underline: false },
                paragraph: Paragraph {
                    alignment: LeftAligned,
                    spacing: Spacing { before: 0, after: 0, between_line: Auto, line_multiplier: 0, },
                    indent: Indentation { left: 0, right: 0, first_line: 0, },
                    tab_width: 0,
                },
                text: "gras",
            },
            StyleBlock {
                painter: Painter { font_ref: 0, font_size: 0, bold: false, italic: false, underline: false },
                paragraph: Paragraph {
                    alignment: LeftAligned,
                    spacing: Spacing { before: 0, after: 0, between_line: Auto, line_multiplier: 0, },
                    indent: Indentation { left: 0, right: 0, first_line: 0, },
                    tab_width: 0,
                },
                text: ".",
            },
        ]
    );
    return Ok(());
}

Benchmark

For now, there is no comparable crates to rtf-parser.
However, the rtf-grimoire crate provide a similar Lexer. Here is a quick benchmark of the lexing and parsing of a 500kB rtf docuement.

Crate	Version	Duration
`rtf-parser`	v0.2.2	30 ms
`rtf-grimoire` (only lexing)	v0.2.1	123 ms

This benchmark has been made on an Intel MacBook Pro.
For the rtf-parser, most of the compute time (65 %) is spent by the lexing process. There is still lot of room for improvement.

Dependencies

~1–1.4MB
~35K SLoC