2 unstable releases
new 0.5.0 | Dec 10, 2024 |
---|---|
0.4.0 | Jul 27, 2023 |
#364 in Parser implementations
122 downloads per month
245KB
2.5K
SLoC
Panfix Parsing
Panfix parsing is a new approach to parsing based on (but more expressive than) operator precedence grammars.
It is not based on Context Free Grammars (CFGs), nor on Parsing Expression Grammars (PEGs). It's has a different style of grammar, roughly a list of multifix operators. The rules for what constitute a valid grammar are extremely simple (unlike, say, LR or LALR parsing).
Panfix parsing always runs in linear time, with no dependence on the size of
the grammar. That is, if N
is the size of the text to be parsed and G
is the
size of the grammar, then a panfix parser runs in O(N)
(not O(NG)
like
Packrat PEG parsing).
It gives very lax parsers, which will happily generate a parse tree in the presense of a variety of errors. On one hand, this increases the number of cases you need to deal with when processing a parse tree. On the other hand, these extra cases are error cases and this is a good time to produce custom error messages for them.
Example: JSON
The best introduction might be a worked example. Let's look at what it takes to parse JSON with panfix parsing. Here's a panfix grammar for JSON:
use panfix::{Parser, Grammar, GrammarError, pattern};
fn make_json_parser() -> Result<Parser, GrammarError> {
let mut grammar = Grammar::new("[ \n\r\t]+")?;
grammar.regex("String", r#""([^\\"]|(\\.))*""#)?;
grammar.regex("Number", r#"-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?"#)?;
grammar.regex("Invalid", r#"[a-zA-Z_][a-zA-Z0-9_]*"#)?; // for catching missing qutoes
grammar.string("Null", "null")?;
grammar.string("True", "true")?;
grammar.string("False", "false")?;
grammar.op("Array", pattern!("[" "]"))?;
grammar.op("Object", pattern!("{" "}"))?;
grammar.right_assoc();
grammar.op("Keyval", pattern!(_ ":" _))?;
grammar.right_assoc();
grammar.op("Comma", pattern!(_ "," _))?;
grammar.finish()
}
The regex
and string
lines are defining how to parse constants like strings
and booleans. regex
means "match this regex patternand
stringmeans "match this exact string". The "Array" line says that
[starts an array and
]` ends
it. The "Object" line is similar.
Next up is:
grammar.right_assoc();
grammar.op("Keyval", pattern!(_ ":" _))?;
grammar.right_assoc();
grammar.op("Comma", pattern!(_ "," _))?;
The two right\_assoc()
calls begin two precedence groups, each with one
operator. Precedence groups have two effects.
First, operators in groups defined earlier bind tighter than operators defined
in later groups. For example, you would define &&
in an earlier precedence
group than >
, so that x < y && y < z
parses as (x < y) && (y < z)
instead
of x < (y && y) < z
.
Second, each precedence group declares whether its operators are left or right
associative. For example, field access _._
should be declared right
associative so that person.birthday.month
is parsed as
(person.birthday).month
instead of person.(birthday.month)
. In the JSON
example, we declare :
and ,
to be right associative, but in this case left
associative would work just as well.
The pattern!(_ "," _)
declares that ,
is an operator that takes a left and
right argument, denoted by an underscore at the beginning and end. In contrast,
pattern!("[" "]")
lacks underscores and thus says that JSON arrays do not take
a left or right argument. On the other hand, an argument is always allowed
between tokens: such as between [
and ]
.
Notice some things that this grammar does not say. Nowhere does it say that
:
is only allowed inside object {...}
, nor that object keys have to be
Strings. Panfix parsing is quite unusual in this regard. You don't define the
exact grammar you want, you instead define a superset of the grammar, and
construct context-specific error messages later, as you process the parse tree.
Parsing some (badly formed) JSON
Let's see what happens when we parse this badly formed "JSON":
{
"id": 999,
"object_class:" "safe",
"weight_kg": 54.5
"disposition": "friendly",
"diet": [
"M&Ms",
"Necco wafers",
"other sweets",
],
"interactions": {
"target_id": 682,
"effect": mixed,
}
}
as so:
let parser = make\_json\_parser().unwrap();
let input = ...read that JSON file...;
let source = Source::new("bad\_json", input);
match parser.parse(&source) { ... }
The call to parse
will actually return Ok
instead of Err
! This is because
panfix parsing is purposefully very lax. The only thing it cares about is that
the operators are complete, and in this example [
is always matched with ]
and
{
is always matched with }
. Thus parse
returns Ok(tree)
. When you Display
this tree, you get:
(Object
(Comma (Keyval "id" 999)
(Comma (_ "object_class:" "safe")
(Comma
(Keyval "weight_kg"
(Keyval (_ 54.5 "disposition") "friendly"))
(Comma
(Keyval "diet"
(Array
(Comma "M&Ms"
(Comma "Necco wafers"
(Comma "other sweets" _)))))
(Keyval "interactions"
(Object
(Comma (Keyval "target_id" 682)
(Comma (Keyval "effect" mixed) _)))))))))
The mechanism for this laxness in parsing is that two kinds of nodes were
implicitly inserted, displayed as _
above.
Blank nodes
Blank nodes are inserted where an argument is missing. Consider, for example, the "diet" portion of the JSON:
"diet": [
"M&Ms",
"Necco wafers",
"other sweets",
],
,
is defined to take two arguments, but the argument after "other sweets",
is missing. Thus in the parse tree a Blank, written _
, is inserted:
(Array
(Comma "M&Ms"
(Comma "Necco wafers"
(Comma "other sweets" _)))))
In JSON this is an error, though in languages that allow a trailing comma it wouldn't be.
Juxtapose nodes
Juxtapose nodes are the complement of Blank nodes: they are inserted where there are two expressions in a row with nothing to join them. For example, this fragment of the JSON:
"weight_kg": 54.5
"disposition": "friendly",
turns into this fragment of the parse tree:
(Keyval "weight_kg"
(Keyval (_ 54.5 "disposition") "friendly"))
where the _
is a Juxtapose node, saying that 54.5
and "disposition"
were
juxtaposed (i.e., adjacent).
Just like the Blank case isn't always an error, the Juxtapose case might be
expected. For example, even in JSON the empty list []
would parse as
(Array _)
.
The combination of Blank and Juxtapose nodes makes panfix parsing very lax. This is beneficial in a way, as we'll see next.
Parse tree -> JSON
What now? The hard work of producing a parse tree (with source locations, to
boot) has already been done. What remains is the easy but verbose work of
walking that tree and converting it into a Json
type:
#[derive(Debug)]
enum Json {
String(String),
Number(f64),
Boolean(bool),
Null,
Array(Vec<Json>),
Object(HashMap<String, Json>),
}
You can find this conversion at examples/json.rs. It's verbose but straightforward. Here's a snippet from it:
fn parse_list(&mut self, mut visitor: Visitor<'s, '_, '_>, elems: &mut Vec<Json>) {
while visitor.name() == "Comma" {
let [head, tail] = visitor.children();
let head = self.parse_value(head);
elems.push(head);
visitor = tail;
}
if visitor.name() == "Blank" {
self.error(visitor, "trailing comma", "JSON does not allow trailing commas.");
} else {
elems.push(self.parse_value(visitor));
}
}
This may feel like busywork, but it's not. Good error messages are extremely
context dependent; panfix
does not know enough to produce quality error
messages on its own. Notice, for example, the message above: "JSON does not
allow trailing commas". That can only be hand written.
visitor
is a reference to a node in the tree. You can call .children()
to get its children, and
.error(short_message, full_message)
to construct an error message at that source location. The
short message is displayed inline next to the source code, while the full message is displayed
separately.
The fact that panfix parsing is so lax comes out to shine: our JSON example produces a whole set of helpful error messages for the bad JSON we've been looking at:
Parse Error: Expected a key:value pair.
--> [stdin]:3:5
|
3 | "object_class:" "safe",
| ^^^^^^^^^^^^^^^^^^^^^^ expected key:value
Parse Error: Expected a JSON value here, not a key:value pair.
--> [stdin]:4:18
|
4 | "weight_kg": 54.5
| ^^^^
5 | "disposition": "friendly",
|^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ unexpected key:value pair
Parse Error: JSON does not allow trailing commas.
--> [stdin]:9:24
|
9 | "other sweets",
| ^ trailing comma
Parse Error: Missing quotes.
--> [stdin]:13:19
|
13 | "effect": mixed,
| ^^^^^ missing quotes
Parse Error: JSON does not allow trailing commas.
--> [stdin]:13:25
|
13 | "effect": mixed,
| ^ trailing comma
For the full example of JSON parsing, see examples/json.rs. For another example, see examples/calc.rs.
Specification
Now that you have the lay of the land, let's switch to focus on the details, requirements, and guarantees.
Panfix Grammars
A panfix Grammar
consists of a whitespace regex, and a set of operators.
Everything is an operator, from numbers to identifiers to binary operators to
function definitions. Each operator has the following properties:
- Name gives the operator a name, so that you can identify it in the parse tree later.
- Tokens say what sequence of tokens the operator matches. For example:
identifiers might match the regex token
[a-zA-Z_]+
; parentheses for grouping would match the sequence of tokens "(", ")"; and function definitions might match "function", "(", ")", "{", "}". - Precedence says how tightly it binds. For example, field access
_._
should bind tighter than array indexing_[_]
, so thatcatalog.entries[0]
parses as(catalog.entries)[0]
rather thancatalog.(entries[0])
. - Associativity says how it should group with itself. For example, minus
should be left associative so that
1 - 2 - 3
is equal to((1 - 2) - 3) = -1 - 3 = -4
instead of right associative where it would be equal to(1 - (2 - 3)) = 1 - -1 = 0
. - Fixity says whether it takes a left argument and whether it takes a right
argument. For example: field access
_._
takes both a left and right argument; array indexing_[_]
takes a left argument but no right argument (because there's an argument before the first token[
but no argument after the last token]
); and a number like2.5
takes neither a left nor a right argument.
You can specify these properties manually with grammar.add_raw_op(name, prec, assoc, fixity, tokens)
, but there's also a higher level interface to make this
easier:
grammar.regex(name: &str, regex_pattern: &str)
defines an operator with a single token and no arguments. Since it has no arguments, the precedence and associativity are irrelevant. Use this for e.g. identifiers, numbers, and string literals.grammar.string(name: &str, string_pattern: &str)
is the same, but matches against a literal string pattern instead of a regex pattern.grammar.left_assoc()
andgrammar.right_assoc()
introduce a new operator group. Operators defined after them belong to that operator group. Operators in an operator group all have the same associativity, and all have the same precedence as each other. Operators defined in earlier groups have tighter precedence than operators defined in later groups. (If neither of these methods is called, the default is left-associativity.)grammar.op(name: &str, pattern: Pattern)
defines an operator. It inherits the precedence and associativity of its group, andpattern
says what it's fixity and tokens are. In thepattern!
macro, fixity is declared by putting a leading or trailing underscore if there is a left or right argument respectively, and putting the tokens in the middle in quotes. For example,pattern!(_ "[" "]")
has two tokens, and its fixity is that is has a left argument but no right argument.
A grammar must obey three simple rules:
- Its regexs must be well formed.
- At most two operators may start with the same token: one that has a left argument, and one that does not. (For example, this allows having an operator for both unary and binary minus: they both start with the token "-", but binary minus has a left argument and unary minus does not.)
- Operators with the same precedence must have the same associativity. (This
rule is enforced by the builder pattern, so you only need to think about it
if you use
add_raw_op
.)
All three rules are enforced by the Grammar
type; you will get an error if
you violate them when you call Grammar.finish()
.
Lexing
When the source is being lexed, multiple regexes from the grammar might match. The rule for which one gets used is that:
- The longer match wins,
- or failing that a string pattern wins over a regex pattern,
- or failing that the pattern defined earlier in the grammar wins.
Parsing
Given a panfix grammar and a source file, you can parse the source in linear time, producing either a parse error or a parse tree.
There are three kinds of parse errors:
- A lexing error. For example, encountering the character
%
in JSON. - An unexpected token. For example, encountering
)
without any(
. - An operator began but was not completed. For example, encountering an
(
but no following)
.
These are the only errors. If there aren't any nonsense tokens and your parentheses (and such) are matched, the parse will succeed!
If parsing is successful, it produces a parse tree. Each node in the tree has an operator and a source location. The operators in the tree include the operators defined in the grammar, plus two extra operators:
- The
"Blank"
operator denotes a missing argument. For example, in the sourcetrue ||
, a "Blank" operator would be inserted to the right of||
. - The
"Juxtapose"
operator denotes that two expressions are adjacent. For example, in the source2 3
, a "Jutxapose" operator would be inserted between2
and3
.
The parse tree is guaranteed to be well formed, in the sense that:
- Each node has children in this order: (i) one child for its left argument, if
it takes one; (ii) one child for each space "between" its tokens; and (iii)
one child for its right argument, if any. Blank nodes have zero children and
Juxtapose nodes have two. (For example, array indexing
_[_]
has two children: 1 for its left argument and 1 for the space between its first and second tokens.) - If a node has a left argument, then either it has looser precedence than its first child, or it is left associative and has equal precedence to its first child. Likewise, if it has a right argument then either it has looser precedence than its last child, or it is right associative and has equal precedence.
- There aren't "too many" blanks and juxtaposes. For example,
1 + 2
will be parsed as(+ 1 2)
and not as(Juxtapose (+ 1 Blank) 2)
.
Putting it all together
- Construct a
Source
, from file or stdin or whatnot. - Construct a
Grammar
using the builder pattern, oradd_raw_op
if you need more control. CallGrammar.finish()
to get aParser
. This will check if the grammar is valid andError
if not. - Parse using
Parser.parse(Source)
. If there are any errors, give up and display them. - If that succeeded, convert the
ParseTree
into an AST (or whatever your internal representation will be). This is the time to check for errors. Basically: walk the tree, checking thename()
of each node: if it's expected then recur, and if it's unexpected then produce a custom error message.
Again, to see a full example, see examples/json.rs or examples/calc.rs.
Dependencies
~2–13MB
~112K SLoC