77 stable releases
new 3.41.0 | May 6, 2025 |
---|---|
3.37.0 |
|
3.25.1 | Oct 26, 2024 |
2.1.5 | Aug 12, 2024 |
0.3.2 | Aug 7, 2024 |
#31 in Parser tooling
3,284 downloads per month
Used in 4 crates
(2 directly)
540KB
12K
SLoC
rusty_lr
A Bison-like, parser generator for Rust supporting LR(1), LALR(1), and GLR parsing strategies.
RustyLR enables you to define context-free grammars directly in Rust. Inspired by tools like yacc and bison, it uses a similar syntax while integrating seamlessly with Rust's ecosystem. It constructs optimized state machine, ensuring efficient and reliable parsing.
Number of terminal symbols reduced to 32 (from 0x10FFFF!) by optimization
Features
- Custom Reduce Actions: Define custom actions in Rust, allowing you to build into custom data structures easily.
- Automatic Optimization:: Reduces parser table size and improves performance by grouping terminals with identical behavior across parser states.
- Multiple Parsing Strategies: Supports LR(1), LALR(1), and GLR parsers.
- Detailed Diagnostics: Detect grammar conflicts, verbose conflicts resolving stages, and optimization stages.
Installation
Add RustyLR to your Cargo.toml
:
[dependencies]
rusty_lr = "..."
To use buildscript tools:
[build-dependencies]
rusty_lr = { version = "...", features = ["build"] }
Or you want to use executable version (optional):
cargo install rustylr
rusty_lr
is designed for use with auto-generated code,
either through lr1!
macro (default), a build script (with build
feature), or the rustylr
executable.
When using a buildscript or executable, you can get beautiful and detailed messages generated from your grammar.
Quick Start
Using Procedural Macros
Define your grammar using the lr1!
macro:
// this define `EParser` struct
// where `E` is the start symbol
lr1! {
%userdata i32; // userdata type passed to parser
%tokentype char; // token type; sequence of `tokentype` is fed to parser
%start E; // start symbol; this is the final value of parser
%eof '\0'; // eof token; this token is used to finish parsing
// left reduction for '+' and '*'
%left '+';
%left '*';
// operator precedence '*' > '+'
// ================= Production rules =================
Digit(char): ['0'-'9']; // character set '0' to '9'
Number(i32) // production rule `Number` holds `i32` value
: ' '* Digit+ ' '* // `Number` is one or more `Digit` surrounded by zero or more spaces
{ Digit.into_iter().collect::<String>().parse().unwrap() }; // this will be the value of `Number` (i32) by this production rule
E(f32): E '*' e2=E { E * e2 }
| E '+' e2=E {
*data += 1; // access userdata by `data`
println!( "{:?} {:?}", E, e2 ); // any Rust code can be written here
E + e2 // this will be the value of `E` (f32) by this production rule
}
| Number { Number as f32 } // Number is `i32`, so cast to `f32`
;
}
This defines a simple arithmetic expression parser.
Using Build Script
For complex grammars, you can use a build script to generate the parser. This will provide more detailed error messages when conflicts occur.
1. Create a grammar file (e.g., src/parser.rs
) with the following content:
// Rust code of `use` and type definitions
%% // start of grammar definition
%tokentype u8;
%start E;
%eof b'\0';
E: b'(' E b')'
| a;
...
2. Setup build.rs
:
// build.rs
use rusty_lr::build;
fn main() {
println!("cargo::rerun-if-changed=src/parser.rs");
let output = format!("{}/parser.rs", std::env::var("OUT_DIR").unwrap());
build::Builder::new()
.file("src/parser.rs") // path to the input file
.build(&output); // path to the output file
}
3. Include the generated source code:
include!(concat!(env!("OUT_DIR"), "/parser.rs"));
4. Use the parser in your code:
let mut parser = parser::EParser::new(); // create <StartSymbol>Parser class
let mut context = parser::EContext::new(); // create <StartSymbol>Context class
let mut userdata: i32 = 0;
for b in input.chars() {
match context.feed(&parser, b, &mut userdata) {
Ok(_) => {}
Err(e) => {
eprintln!("error: {}", e);
return;
}
}
}
println!("{:?}", context);
context.feed(&parser, 0 as char, &mut userdata).unwrap(); // feed EOF
let result:i32 = context.accept(); // get value of start 'E'
Using rustylr
Executable
cargo install rustylr
rustylr parser.rs output.rs
See Executable for more details.
The generated code will include several structs and enums:
<Start>Parser
: A struct that holds the whole parser table. (docs-LR) (docs-GLR)<Start>Context
: A struct that maintains the current state and the values associated with each symbol. (docs-LR) (docs-GLR)<Start>State
: A type representing a single parser state and its associated table.<Start>Rule
: A type representing a single production rule. (docs)<Start>NonTerminals
: A enum representing all non-terminal symbols in the grammar. (docs)
You can get useful information from <Start>NonTerminals
enum.
let non_terminal: <Start>NonTerminals = ...;
non_terminal.is_auto_generated(); // true if this non-terminal is auto-generated
non_terminal.is_trace(); // if this non-terminal is marked with `%trace`
You can also get contextual information from <Start>Context
struct.
let mut context = <Start>Context::new();
// ... parsing ...
context.expected(); // get expected terminal symbols
context.expected_nonterm(); // get expected non-terminal symbols
context.can_feed( term ); // check if a terminal symbol can be fed
context.trace(); // get all `%trace` non-terminals that are currently being parsed
The generated code will also include a feed
method that takes a token and a mutable reference to the user data. This method will return an Ok(())
if the token was successfully parsed, or an Err
if there was an error.
context.feed( &parser, term, &mut userdata ); // feed a terminal symbol and update the state machine
Note that the actual definitions are bit different if you are building GLR parser.
GLR Parsing
RustyLR offers built-in support for Generalized LR (GLR) parsing, enabling it to handle ambiguous or nondeterministic grammars that traditional LR(1) or LALR(1) parsers cannot process. See GLR.md for details.
Semantic Error Handling
RustyLR provides a mechanism for handling semantic errors during parsing.
- return
Err
from the reduce action - using
error
token for panic-mode-error-recovery Or combine both.
JsonObject: '{' JsonKeyValue* '}'
| '{' error '}' { println!("recovering with '}}'"); }
;
The error
token is a reserved non-terminal symbol that can be matched with any tokens.
In the above example, if the parser encounters an invalid token while parsing a JSON object, it will enter panic mode and discard all tokens until it finds a closing brace }
.
When an invalid token is encountered,
the parser enters panic mode and starts discarding symbols from the parsing stack until it finds a point where the special error
token is allowed by the grammar.
At that point, it shifts the invalid fed token as the error
token,
respectively trying to complete the rule that contains the error
token.
- The
error
token does not have any value, no associated rule-type.
Examples
- Calculator: A calculator using
u8
as token type. - Json Validator: A JSON validator
- lua 5.4 syntax parser
- Bootstrap: rusty_lr syntax parser is written in rusty_lr itself.
Lexer Capabilities
While RustyLR is primarily a parser generator, it also functions effectively as a lexer. Its design allows for efficient tokenization of input streams, addressing challenges like the "too-many-characters" problem. By constructing optimized state automata, it ensures rapid and memory-efficient lexing, making it suitable for processing large or complex inputs.
Cargo Features
build
: Enable build script tools.tree
: Enable automatic syntax tree construction (For debugging purposes).error
: Enable detailed parsing error messages with backtrace (For debugging purposes).
Syntax
RustyLR's grammar syntax is inspired by traditional Yacc/Bison formats. See SYNTAX.md for details of grammar-definition syntax.
Contribution
- Any contribution is welcome.
- Please feel free to open an issue or pull request.
License (Since 2.8.0)
Either of
- MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
Dependencies
~230–660KB
~16K SLoC