#ast #grammar #tokens #ast-parser #parse #language #parse-tree

sprout

A rust crate for growing simple, but beautiful AST trees 🌳

5 releases (1 stable)

1.0.0 Nov 6, 2022
0.2.1 Nov 5, 2022
0.2.0+1 Nov 5, 2022
0.1.1 Nov 3, 2022
0.1.0 Nov 3, 2022

#18 in #ast-parser

40 downloads per month

MIT license

120KB
2K SLoC

Sprout

sprout is a rust crate that parses text into ASTs (Abstract Syntax Trees), given definitions for tokens and grammars built from those tokens.

Usage

You can start by importing the sprout prelude. For most projects, this will be sufficient.

use sprout::prelude::*;

Example

First, define an enum for your tokens. It should derive the necessary traits shown below.

#[derive(Debug, PartialEq, Eq, Clone, Copy)]
pub enum Token {
   Number,
   Word,
   Space
}

Implement std::fmt::Display for your token enum. This will be used to generate human-readable error messages.

use std::fmt;

impl fmt::Display for Token {
   fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
      match self {
         Self::Number => write!(f, "number"),
         Self::Word => write!(f, "word"),
         Self::Space => write!(f, "space")
      }
   }
}

Now, you can use the alphabet macro to provide definitions for your tokens using a subset of regular expressions syntax that includes parentheses (()), square brackets ([]), ranges ([a-z]), as well as the operators *, + and ?.

use Token::*;

let alphabet = alphabet! {
   Number => "[0-9]+";
   Word => "[a-z]+";
   Space => " "
};

Next, you define your grammar, in terms of "procedures", or "procs". They are the abstract parts of your language that will end up forming the nodes of your syntax tree.

First of all, you again create an enum:

#[derive(Debug, PartialEq, Eq, Hash, Clone, Copy)]
pub enum Proc {
   TwoOrThreeWords,
   WordOrNumber,
   Sequence
}

And, again, implement std::fmt::Display:

impl fmt::Display for Proc {
   fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
      match self {
         Self::TwoOrThreeWords => write!(f, "two or three words"),
         Self::WordOrNumber => write!(f, "word or number"),
         Self::Sequence => write!(f, "word/number sequence")
      }
   }
}

Finally, define your procedures using the grammar macro.

use Token::*;
use Proc::*;

let grammar = grammar! {
   #TwoOrThreeWords => Word, Space, Word, (Space, Word)?;
   #!WordOrNumber => [Word; Number];
   #?Sequence => ([#TwoOrThreeWords; #WordOrNumber]){Space}+,
};

As you can see in this example, within a grammar definition, names of procedures are always prefixed with a #.

The #! prefix on a procedure definition means that that procedure is primitive. Primitive procedures are considered to be the simple building blocks of your language, which means that error messages will refer to them by their name rather than their composite parts. You should use #! on procedures that are so low-level that their construction can be considered an implementation detail and is not relevant to the user.

The #? prefix on a procedure definition means that that procedure is hidden, meaning that they will never appear in error messages by name. This is good to make sure your error messages don't get too abstract. You should use #? on procedures that are so high-level and abstract that they are not relevant to localized parsing errors.

Names of tokens are usually left as-is, but you can also designate tokens as signatures by prefixing them with an @. Signature tokens give the current procedure priority over others, even if it only matches up to that point. This is handy if you want to define a certain structure that forces the parser to interpret the text in a certain way, even if it leads to an error. It is good to use signatures to try to capture the intuition of when a certain procedure is "obviously" meant, even if its structure is incorrect; this helps produce more helpful error messages.

Sequences of tokens/procedures are comma-separated.

You can use the following special syntax inside procedure definitions for more complex patterns:

Syntax Description
(<A>)* Repeat <A> zero or more times
(<A>)+ Repeat <A> one or more times
(<A>){<B>}* Repeat <A> zero or more times, delimited by <B>
(<A>){<B>}+ Repeat <A> one or more times, delimited by <B>
(<A>)? <A> is optional
[<A>; <B>; ...] Select one of <A>, <B>, etc.

Now, finally, from your alphabet and grammar, you can construct a parser:

let parser = Parser::new(alphabet, grammar);

This parser will now simply spit out an AST (or a ParsingError) for any string you throw at it! For example this input

let tree = parser.parse(Proc::Sequence, "abc ab 123 xyz 69".to_string());

would produce an AST like this:

Sequence["abc ab 123 xyz 69"](
   TwoOrThreeWords["abc ab"]
   WordOrNumber["123"]
   WordOrNumber["xyz"]
   WordOrNumber["69"]
)

Specifically, the output type, AST, is an alias for a trees::Tree of ASTNodes, where ASTNode has these fields:

Name Type Description
proc Proc (or whatever you called it) The procedure that this node corresponds to
text String The text contained in the instance of the procedure
pos TextPosition (has fields line and char) The position of the start of the instance in the text

For more information about the API of trees::Tree see the trees documentation

Also...

Yes, I know the description of this crate falls victim to the RAS syndrome, but I don't care :)

Dependencies

~2–12MB
~86K SLoC