#parser-generator #grammar #expression #recursion #language #error #white-space

app langlang

langlang is a parser generator based on Parsing Expression Grammars (tools)

3 releases

0.1.2 Oct 2, 2023
0.1.1 Oct 2, 2023
0.1.0 Dec 29, 2022

#114 in Parser tooling

GPL-3.0-or-later

155KB
3K SLoC

Table of Contents

  1. Introduction
    1. Project Status
    2. Currently supported output languages
      1. Notes
    3. Basic Usage
  2. Input Language
    1. Productions and Expressions
    2. Terminals
    3. Non-Terminals
    4. Expression Composition
      1. Ordered Choice
      2. Predicates (Not/And)
      3. Repetition ({Zero,One} Or More)
      4. Lexification
      5. Error reporting with Labels
      6. Import system
  3. Generator Options
    1. Go
  4. Roadmap

Introduction

Bring your own grammar and get a feature rich parser generated for different languages. The are reasons why you might want to use this:

  • Concise input grammar format and intuitive algorithm: generates recursive top-down parsers based on Parsing Expression Grammars
  • Automatic handling of white spaces, making grammars less cluttered
  • Error reporting with custom messages via failure labels
  • Partial support for declaring error recovery rules, which allow incremental parsing that returns an output tree even upon multiple parsing errors.

Project Status

  • We're not 1.0 yet, so the API is not stable, which means that data structure shapes might change, and/or behavihor might change, drastically, and without much notice.
  • Don't submit pull requests without opening an issue and discussing your idea. We will take the slow approach aiming at great design first, then being stable, then being featureful.

Currently supported output languages

  • Rust¹
  • Go Lang²
  • Python
  • Java Script
  • Write your own code generator

Notes

  1. Rust support is based on a virtual machine as its runtime, and not on a generated parser. That is unlikely to change because our tooling will be built in Rust, so we need more flexibility on the "host implementation". We may give an option to also generate a parser in Rust code that doesn't depend on any libraries, if it provides any value. But such work isn't planned as of right now.

  2. We're in the middle of dropping the Go implementation in favor of a generating a Go parser from the code written in Rust. Prototyping a few features, like the import system and automatic space handling, was very useful, but once the refactoring of the Rust implementation is in a good place, the Rust version will be better as it will make it easier to generate parsers for other languages than Rust and Go.

Basic Usage

If you just want to test the waters, point the command line utility at a grammar and pick a starting rule:

cargo run --bin langlang run --grammar-file grammars/json.peg --start-rule JSON

That will drop you into an initeractive shell that allows you to try out different input expressions.

Take a look at other examples at the directory grammars in the root of the repository. It contains a grammar library for commonly used input formats.

Input Language

Productions and Expressions

The input grammar is as simple as it can get. It builds off of the original PEG format, and other features are added conservatively. Take the following input as an example:

Production <- Expression

At the left side of the arrow there is an identifier and on the right side, there is an expression. These two together are called either productions or (parsing) rules. Let's go over how to compose them. If you've ever seen or used regular expressions, you've got a head start.

Terminals

  • Any: matches any character, and only errors if it reaches the end of the input. e.g.: .

  • Literal: anything around quotes (single and double quotes are the same). e.g.: 'x'

  • Class and Range: classes may contain either ranges or single characters. e.g.: [0-9], [a-zA-Z], [a-f0-9_]. This last example contains two ranges (a-f and 0-9) and one single char (_). It means match either one of these. e.g.: [a-cA-C] is translated to 'a' / 'b' / 'c' / 'A' / 'B' / 'C'.

Non-Terminals

The biggest addition of this type of grammar on top of regular expressions is the ability to define and recursively call productions. Here's a grammar snippet for parsing numbers:

Signed   <- ('-' / '+') Signed / Decimal
Decimal  <- ([1-9][0-9]*) / '0'

The topmost production Signed calls itself or the production Decimal. It allows parsing signed and unsigned numbers recursively. (e.g.: +-+--1 and so forth would be accepted).

Expression Composition

The following operators can be used on both Terminals and Non-Terminals, on top of parenthesized expressions:

operator example comment
ordered choice e1 / e2  
not predicate !e  
and predicate &e sugar for !!e
zero or more e*  
one or more e+ sugar for ee*
optional e? sugar for &ee / !e
lexification #e  
label e^label sugar for e/throw(label)

Ordered Choice

This operator tries expressions one at a time, from left to right, and stops at the first one to succeed. Or error if no alternatives work. E.g.:

SomeDigits <- '0' / '1' / '2' / '3' / '4'

Passing 6 to the above expression will generate an error.

Predicates (Not/And)

Predicates are the mechanism that allows unlimited look ahead, as they do not consume any input. e.g.:

BracketString <- "[" (!"]" .)* "]"

In the above example, the any expression isn't evaluated if the parser finds the closing square bracket.

The and predicate (&) is just syntactical sugar for !!.

Repetition ({Zero,One} Or More)

  • Zero Or More: it never fails, as it can match its expression at least zero times.

  • One Or More is the syntax sugar for calling the expression once, followed by applying zero or more to the same expression. It can fail at the first time it matches the expression.

  • Optional it will match an expression zero or one time.

Lexification

By default, the generated parsers emit code to consume whitespaces automatically before each item within a sequence of a production that's considered not syntactic. Productions are considered syntactic if all their expressions are syntactic. Expressions are considered syntactic if their output tree is composed only of terminal matches. If there's any path to a non-terminal match, the entire expression, and production are considered non syntactic. e.g.:

NotSyntactic <- Syntactic "!"
Syntactic    <- "a" "b" "c"

In the above example, there is no automatic space consumption injected before the items of the sequence expression "a" "b" "c" as all of them are terminals. And the NotSyntactic production contains non terminal calls, which makes it non-syntactic. Therefore, automatic space handling will be enabled for NotSyntactic and disabled for Syntactic

For disabling automatic space handling of an expression, prefix it with the lexification operator #. e.g.:

Ordinal <- Decimal #('st' / 'nd' / 'rd' / 'th')^ord
Decimal <- ([1-9][0-9]*) / '0'

In the above expression, Decimal is considered syntactic, which disables automatic space handling. Ordinal is not syntactic because it calls out to another production with a non-terminal. So, automatic space handling is enabled for that production. However, between the non-terminal and the choice with terminals, space handling is disabled. This is what is expected

Input Result
" 3rd" succeeds
"50th" succeeds
"2 0th" fails
"2 th" fails

The first input succeeds because space consumption is automatically added to the left of the call to the non terminal Decimal, as Ordinal is not syntactic. But, because the expression that follows the non terminal is marked with the lexification operator, automatic space handling won't be injected between the call to the non terminal and the ordered choice with the syntactic suffixed st, nd, rd, and th.

Here is maybe the most classic example of where lexification is needed: Non-Syntactic String Literals. Which uses eager look ahead and spaces are significant. e.g.:

SyntacticStringLiteral     <- '"' (!'"' .) '"'
NonSyntacticStringLiteral  <- DQ #((!DQ .)  DQ)

Without using the lexification operator on the rule NonSyntacticStringLiteral, it would eat up the spaces after the first quote, which can be undesired for string fields.

The rule SyntacticStringLiteral doesn't need the lexification operator because all of its sub-expressions are terminals, therefore the rule is syntactic and space consumption won't be generated by default anyway.

There are definitely more use-cases of the lexification operator out there, these are just the common ones.

Error reporting with Labels

Import system

Productions of one grammar can be imported from another one. That allows reusing rules and delivering more consolidate grammar files and more powerful parser generated at the end.

// file player.peg
@import AddrSpec from "./rfc5322.peg"

Player <- "Name:" Name "," "Score:" Number "," "Email:" AddrSpec
Name   <- [a-zA-Z ]+
Number <- [0-9]+
// ... elided for simplicity

// file rfc5322.peg
// https://datatracker.ietf.org/doc/html/rfc5322#section-3.4.1

// ... elided for simplicity
AddrSpec  <- LocalPart "@" Domain
LocalPart <- DotAtom / QuotedString / ObsLocalPart
Domain    <- DotAtom / DomainLiteral / ObsDomain
// ... elided for simplicity

The above example illustrates that a rather complete email parser can be used in other grammars using imports. Behind the scenes, the AddrSpec rule and all its dependencies have been merged into the player.peg grammar.

Generator Options

Go

The Go code generator provides the following additional knobs to the command line:

  • --go-package: allows customizing what goes in the package directive that starts each Go file.

  • --go-prefix: allows customizing structs generated prefixing what is passed to this option. This is especially useful if there are two grammars to be parsed in the same package. At least one will need a prefix, so the generic Parser name doesn't collide. e.g.: -go-prefix Tiny would generate a TinyParser struct, a NewTinyParser constructor, etc.

Roadmap

  • MID: gengo rewrite Go generator in Rust
  • MID: [genall] generator interface to be shared by all targets
  • SML: gengo memoize results to guarantee O(1) parsing time
  • SML: gengo allocate output nodes in an arena
  • MID: genpy Python Code Generator: Start from scratch
  • MID: genjs Java Script Code Generator
  • MID: gengo explore generating Go ASM code instead of text
  • MID: Display Call Graph for debugging purposes
  • BIG: Bootstrap off hand written parser, so grammar writters can take advantage of the features baked into the parser generator

Dependencies

~4.5–6.5MB
~111K SLoC