#compiler #parser-tools #parser #text-parser #macro-derive

no-std peggen

Generate recursive-descent & precedence climbing parsers

4 releases

0.2.6 Sep 22, 2024
0.2.2 Aug 31, 2024
0.2.1 Aug 21, 2024
0.2.0 Aug 16, 2024

#894 in Rust patterns

MIT/Apache

31KB
529 lines

Peggen

A parser generator for parsing expression grammar (PEG) that use inline macros to specify PEG operations.

How is it different from (...)?

/ Conceptual User Experience Performance Error Handling
PEST PEST only annotates text.
Peggen generates AST directly from your text.
In most cases, you still want rust enums for your AST, which is directly provided by Peggen, but you have to manually create enums from PEST rules. PEST use an optimizer to memorize your grammar rules, and use memorization for better performance; Peggen doesn't use memorization, arguably this gives better performance over memorization for most grammars. /
Chumsky Chumsky provides parser combinators. Peggen is a parser generator. Both Chumsky and Peggen provides ast directly. However, Peggen supports arena allocation. Chumsky deallocates successful sub-rules when a rule fails; Peggen uses a internal representation to eliminate deallocation. /
LALRPOP Peggen is PEG-based; LALRPOP uses LR(1) grammar. Peggen is more intuitive to use than LALRPOP; LR(1) grammar is hard to extend and debug. LALRPOP has better performance over Peggen. LR(1) grammar can report errors far away from normally percepted cause; Peggen allows you to capture errors from customary cause.

Performance

I roughly tested the peggen on a sample json file against chumsky.

CPU Model: Intel(R) Core(TM) i7-14700HX

Suprisingly, Peggen is faster than Chumsky.

Here are some numbers:

  • Peggen : 867913 ns/iter
  • Chumsky: 1555256 ns/iter

Example: Json Parser

You can write a json parser in the following several lines:

#[derive(Debug, ParseImpl, Space, Num, EnumAstImpl)]
pub enum Json {
    #[rule(r"null")]
    Null,
    #[rule(r"{0:`false|true`}")]
    Bool(bool),
    #[rule(r"{0:`-?(0|[1-9][0-9]*)\.([0-9]+)`}")]
    Flt(f32),
    #[rule("{0:`0|-?[1-9][0-9]*`}")]
    Num(i32),
    #[rule(r#""{0:`[^"]*`}""#)]
    Str(String),
    #[rule(r#"\{ [*0: "{0:`[^"]*`}" : {1} , ][?0: "{0:`[^"]*`}" : {1} ] \}"#)]
    Obj(RVec<(String, Json)>),
    #[rule(r"\[ [*0: {0} , ][?0: {0} ] \]")]
    Arr(RVec<Json>)
}

Roadmap

  • Optimizations:
    • Rule dispatch: filter rules by the first symbol, instead of trying each of them.
    • Thinner tag: currently each tag in internal representation is 3-pointers wide, I want to make them thinner.
  • Error Handling:
    • Custom error handlers when error handlers fail.

Dependencies

~4–12MB
~135K SLoC