#lexer #generator #dfa #regex #proc-macro #lex #tokenizer

macro luther-derive

The proc macro generator for the Luther lexer generator

1 unstable release

Uses old Rust 2015

0.1.0 May 28, 2018

#20 in #lex

Apache-2.0/MIT

47KB
808 lines

Luther derive

Luther is an embedded lexer generator for stable Rust.

This crate is the proc macro implementation for deriving the Lexer trait from the Luther crate. See the crate level documentation for the options recognized by the proc macro. See the Luther crate for an example of the usage of this crate.

License

Luther is licensed under either of

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in Luther by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.


lib.rs:

luther_derive provides a procedural macro to derive the luther::Lexer trait.

Deriving the luther::Lexer trait is expected to be the primary (possibly only) way of implementing this trait. The trait can be derived on an enum of token types where the variants of the enum are annotated with a regular expression. Not all variants of the enum need to be annotated with a regular expression, but variants that do not have such an annotation will not be returned by the lexer that luther_derive generates.

Generating the lexer adds a visible type name for the deterministic finite automaton that the lexer uses internally. Once hygenic macros are available it will be possible to hide this name, but with the current implementation of procedural macros the name will be visible. By default the name is formed by adding a suffix of Dfa to the name of the enum on which luther::Lexer is derived. This default can be overridden with the dfa option of the luther attribute.

Example

extern crate luther;

#[macro_use]
extern crate luther_derive;

#[derive(Lexer)]
enum Token {
    #[luther(regex = "ab")]
    Ab,

    #[luther(regex = "acc*")]
    Acc(String),
}

Capturing the recognized characters.

If a variant of the enum on which the lexer is being geneated includes a single type (like the Acc variant in the above example) and that type implements str::FromStr (like String does for the Acc example) then the generated lexer will capture the recognized characters when it has matched that variant's regular expression. It will capture the characters as a value of the type using the type's str::FromStr implementation.

It is an error to have more than one type included in an enum variant. luther_derive will recognize this error. It is also an error to have a signle type that does not implement str::FromStr, but luther_derive cannot recognize this error. This case will likely manifest itself with a confusing error message from the compiler.

For now the single type included in an enum must also implement default::Default, although this restriction may be lifted in the future.

The code to capture the characters will be someting similar to characters.parse().unwarp_or_default() where characters is a &str of the recognized characters.

The luther attribute

luther_derive recognized the luther attribute both on the enum for which luther::Lexer is being derived and on the variants of that enum. luther supports various options which are invoked like `#[luther(option = "value")].

The options supported by the luther attribute are the following with an indication of where the option is valid (the enum or the variants):

  • dfa: the name to use for the generated deterministic finite automaton [enum]
  • regex: the regular expression to recognize for particular variant [variant]
  • priority_group: the priority group to which a variant belongs [variant]

Priority groups

It is possible for the regular expressions for more than one enum variant to match the same input. For example, the following regular expressions all match the input "auto":

  1. "auto"
  2. "[a-z]+"
  3. "[a-z]+[0-9]*"

The lexer generated by luther_derive will favour simple strings as the regex option on the luther attribute over more complicated regular expressions. In the examples listed above this means that item 1 will be prefered over either item 2 or 3. This rule allows the lexer to prefer keywords over identifiers, for example.

If the preference for simple strings is not enough to resolve the ambiguity, though, then you will have to use the priority_group option of the luther attribute to indicate which of the two (or more) is a higher priority (a smaller number indicates a higher priority). Within a priority group, though, luther_derive will continue to favour simple strings over other more complicated regular expressions.

The default value for priority_group if it is not specified is 1.

Errors

luther_derive will raise an error at compile time in the following circumstances (among others):

  • the #[derive(Lexer)] invocation is on a struct rather than an enum
  • none of the variants of the enum have a luther attribute with the regex specified
  • one of the regex's specified for a variant would match the empty string
  • a variant has included types that are not a tuple of arity 1
  • the value provided for the regex option can't be parsed as a regular expression
  • the value provided for the priority_group option can't be parsed as an integer

Dependencies

~2.5MB
~61K SLoC