4 releases

0.2.0	Nov 27, 2023
0.1.3	Nov 21, 2023
0.1.2	Nov 14, 2023
0.1.1	Aug 30, 2023

#249 in Parser tooling

33 downloads per month

GPL-3.0 license

67KB
2K SLoC

tuck5: The Next Attempt

(at making a parser generator)

Tuck5 is a lexer/parser generation library written in Rust with a dual focus on performance and ease of use. The meta system is the main, and recommended, method of interacting with tuck5. It provides a domain-specific language that allows developers to easily define how to transform a set of tokens. Under the hood, this simple syntax is executed as quick for-loops, with an unprecedented level of developer control of execution.

Understanding the TUCK nature

tuck5 uses a system of tokens, as is to be expected by a tokenizer framework. Tokens are either branches or leaves. If a token is a leaf, it contains a range of indices into its corpus text. If it is a branch, it contains a vector of tokens as its children. Tokens also contain some extra data: as the developer, you are able to statically control this data to fit your needs. (Note: if you are using the meta system, this auxiliary data takes the form of a vector of owned strings, which act as tags. A transition is planned to string pointers or perhaps a bitflags-based system. This does not affect the experience of using the meta system.)

To parse a corpus string, the characters of the string are initially converted into a series of leaf tokens where each token has two tags: a where a is the content of the token, u97 where 97 is the decimal conversion of the UTF-8 value of the character, and ws ONLY if the character is classified as Unicode whitespace.

Then, a series of rules are applied to the tokens. A rule consists of a condition that may match a sequence of tokens, a transformation method (whether a matching sequence is converted into a single leaf token, a single branch token, or removed entirely), and a series of tags to apply to the token that the matching tokens will be converted into. For examples of this, see the section [[#Using the meta system]].

Each rule is applied on each token, one-by-one. Many rules accept multiple tokens. They have specifically defined behavior for reaching the end of the corpus string. Each rule is repeated applied to the corpus string until it no longer can be applied. If specified, a group of rules can be applied in sequence, repeatedly, until none of them can be applied.

Using the `meta` system

Define a meta-parser (a parser that is generated by the meta system) with this syntax.

Rules

'abc' for any string abc: matches if the next token's content exactly equals the characters between the 's.

a..z for any characters a and z: matches if the next token is one character long, and that character is between a and z inclusive.

{.abc.} for any string abc with x characters: matches if the next x tokens exactly match each character in abc.

tag for any string tag: matches if the next token has the tag tag.

rule1 & rule2: matches if rule1 applies from token x to token y, then rule2 applies from token y + 1 to token z. Can be chained.

rule1 | rule2: matches if either rule1 or rule2 applies. Can be chained.

rule?: If rule matches the next x tokens, this matches for x tokens. Otherwise, matches 0 tokens. Not intended to be used alone (will probably lock your parser up until warnings are implemented).

rule*: Repeatedly matches rule. If rule matches tokens x to y, tries again on tokens y to z and so on. If rule doesn't match, matches all the tokens that rule did match on. Once again, not intended to be used alone, as it will match even if rule matched 0 times.

rule+: Repeatedly matches rule after matching it one time: see above. Equivalent to rule & rule*. Can be used alone.

Transformations

rule . tag0, tag1, etc;: If rule matches the next x tokens, transforms those tokens into one leaf token (discarding the data of the "children") with the specified tags. More performant than the alternative below.

rule : tag0, tag1, etc;: If rule matches the next x tokens, transforms those tokens into one branch token (with the transformed tokens as the children) with the specified tags. Useful for converting into an AST.

rule~;: If rule matches the next x tokens, removes those tokens. Most often used for whitespace removal, but if AST generation is the goal, it can also be used for comments.

Miscellaneous

# bla bla bla (line break): A one-line comment.

## bla bla bla ##: A multi-line comment.

{ transformation transformation ... }: Apply each of these transformations in order until none of them match. Useful for preserving an order of operations.

(rule): The same as a rule. Useful for the order of operations of the meta system or for clarity.

Examples

See the calculator folder in tests for a full example of the meta system. Here is an explanation of a simpler version of it.

(0..9)+. positive, int, expr;

Here, we define our positive integers. 0..9 represents a range of characters, which will match against the digits from 0 to 9. The + signifies to match against one or more digits.

int & '.' & int. positive, decimal, expr;

Here, int matches against any token we recently defined with the tag int. If the sequence of an integer, followed by a decimal point, followed by an integer (which, as we defined, can have leading zeroes) is found, these are transformed into a decimal token.

'-' & positive. negative, expr;

Now we define the negative numbers. If a minus sign is matched, then a positive number, we apply the negative tag.

ws~;

Here, we remove whitespace, as specified by the pre-defined ws tag.

{
    '(' & expr & ')': parens, expr;
}

The braces around this statement means that there are more to follow it, and they should be executed in sequence, repeatedly. We will see this in action when we add the operations. For now, if parentheses surround a token with the expr tag (which you may have noticed was added to every number), all three tokens will be transformed into one with the parens and expr tags.

{
    '(' & expr & ')': parens, expr;
    expr & '*' | '/' & expr: oper, expr;
}

Here, we add multiplication and division to our repeated transformations. With the braces, you'll find that any sequence of nested parentheses, multiplications, and divisions will follow the rules of PEMDAS.

{
    '(' & expr & ')': parens, expr;
    expr & '*' | '/' & expr: oper, expr;
    expr & '+' | '-' & expr: oper, expr;
}

Finally, we complete our simple calculator by adding addition and subtraction. To access our tagged tokens in Rust, use the tuck5::meta::eval_prog_from_text function with text of your meta-parser and the text you'd like parsed. The result will be a vector of tokens.