5 releases
0.2.0 | May 12, 2021 |
---|---|
0.1.3 | Oct 28, 2020 |
0.1.2 | Oct 27, 2020 |
0.1.1 | Oct 16, 2020 |
0.1.0 | Oct 16, 2020 |
#177 in Parser tooling
89KB
926 lines
Flexer
This library provides a finite-automata-based lexing engine that can flexibly tokenize an input stream.
lib.rs
:
This module exports the API for defining a simple lexer based on a deterministic finite state automaton.
Lexers defined using the Flexer are capable of lexing languages of significant complexity, and while they're defined in a simple way (akin to regular grammars), they can work even with context-sensitive languages.
The process of defining a lexer involves the user doing the following:
- Creating a
Lexer
type that wraps theFlexer
. - Creating a
State
type, to hold the user-defined lexing state. - Implementing
State
for theState
type. - Implementing
Definition
for theLexer
, along with lexing transition rules to lex the language.
The result of defining a lexer using the flexer is a hybrid of the code written using this library, and also the code that this library generates to specialize your lexer.
Writing a Lexer
As the Flexer is a library for writing lexers, it would be remiss of us not to include a worked example for how to define a lexer. The following example defines a lexer for a small language, and shows you how to integrate the flexer code generation step with your project's build.
The Language
We're going to define a lexer for a very simple language, represented by the following EBNF grammar.
a-word = 'a'+;
b-word = 'b'+;
word = a-word | b-word;
space = ' ';
spaced-word = space, word;
language = word, spaced-word*;
The Lexer's Output
Every lexer needs the ability to write a stream of tokens as its output. A flexer-based lexer
can use any type that it wants as its output type, but this language is going to use a very
simple Token
type, wrapped into a TokenStream
.
#[derive(Clone)]
pub enum Token {
/// A word from the input, consisting of a sequence of all `a` or all `b`.
Word(String),
/// A token that the lexer is unable to recognise.
Unrecognized(String)
}
#[derive(Clone,Default)]
pub struct TokenStream {
tokens:Vec<Token>
}
impl TokenStream {
pub fn push(&mut self,token:Token) {
self.tokens.push(token)
}
}
These tokens will be inserted into the token stream by our lexer as it recognises valid portions of our language.
Whatever you choose as the Output
type of your lexer, it will need to implement both
std::clone::Clone
and std::default::Default
.
The Lexer's State
Every Flexer-based lexer operates over a state that holds all of the user-defined state
information required to define the particular lexer. This state type must conform to the
State
trait, which defines important functionality that it must provide to the flexer.
In our language, we want to only be able to match words with a preceding space character once
we've seen an initial word that doesn't have one. To this end, we need a state in our lexer to
record that we've 'seen' the first word. As required by the State
trait, we also need to
provide the flexer with an initial state, the state registry, and the bookmarks we use.
use enso_flexer::group;
use enso_flexer::prelude::reader::BookmarkManager;
use enso_flexer::State;
#
#
#
#
#
// === LexerState ===
#[derive(Debug)]
pub struct LexerState {
/// The registry for groups in the lexer.
lexer_states:group::Registry,
/// The initial state of the lexer.
initial_state:group::Identifier,
/// The state entered when the first word has been seen.
seen_first_word_state:group::Identifier,
/// The bookmarks for this lexer.
bookmarks:BookmarkManager
}
The flexer library provides useful functionality to help with defining your lexer state, such as
group::Registry
for containing the various states through which your lexer may transition,
amd prelude::reader::BookmarkManager
for storing bookmarks.
Bookmarks
In order to enable arbitrary lookahead, the flexer provides a system for "bookmarking" a point in the input stream so that the lexer may return to it later. In fact, this mechanism is used by default in the implementation to deal with overlapping rules, and so the
prelude::reader::BookmarkManager
provides some bookmarks for you by default.As a user, however, you can define additional bookmarks as part of your state, and mark or return to them as part of your lexer's transition functions (more on this below).
Now that we have our state type, we need to define an implementation of State
for it. This
is a mostly trivial exercise, but two functions ([State::new()
] and State::specialize
)
require special attention. We'll look at both below.
use enso_flexer::generate;
use enso_flexer::generate::GenError;
use enso_flexer::prelude::AnyLogger;
#
#
#
#
#
#
#
#
impl enso_flexer::State for LexerState {
fn new(_logger:&impl AnyLogger) -> Self {
// Here we construct all of the elements needed for our lexer state. This function can
// contain arbitrarily complex logic and is only called once at initialization time.
let mut lexer_states = group::Registry::default();
let initial_state = lexer_states.define_group("ROOT",None);
let seen_first_word_state = lexer_states.define_group("SEEN FIRST WORD",None);
let bookmarks = BookmarkManager::new();
Self{lexer_states,initial_state,seen_first_word_state,bookmarks}
}
fn initial_state(&self) -> group::Identifier {
self.initial_state
}
fn groups(&self) -> &group::Registry {
&self.lexer_states
}
fn groups_mut(&mut self) -> &mut group::Registry {
&mut self.lexer_states
}
fn bookmarks(&self) -> &BookmarkManager {
&self.bookmarks
}
fn bookmarks_mut(&mut self) -> &mut BookmarkManager {
&mut self.bookmarks
}
fn specialize(&self) -> Result<String,GenError> {
// It is very important to pass both the type name of your lexer and your output
// correctly here. This function should always be implemented as a call to the
// below-used function.
generate::specialize(self,"TestLexer","Token")
}
}
Defining the Lexer Type
With our state type defined, we now have the prerequisites for defining the lexer itself!
The notion behind the way we define lexers in the flexer is to use a chain of
std::ops::Deref
implementations to make the disparate parts feel like a cohesive whole.
The Flexer
itself already implements deref to your state type, so all that remains is to do
the following:
- Define your lexer struct itself, containing an instance of the
Flexer
, parametrised by your state and output types.
use enso_flexer::Flexer;
use enso_flexer::prelude::logger::Disabled;
type Logger = Disabled;
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
// === Lexer ===
pub struct Lexer {
lexer:Flexer<LexerState,TokenStream,Logger>
}
You'll note that the Flexer
also takes a logging implementation from the Enso logging library
as a type parameter. This lets the client of the library configure the behaviour of logging in
their lexer. We recommend aliasing the current logger type (as shown above) for ease of use.
- Implement a
new()
function for your lexer.
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
impl Lexer {
pub fn new() -> Self {
let lexer = Flexer::new(Logger::new("Lexer"));
Lexer{lexer}
}
}
- Define
std::ops::Deref
andstd::ops::DerefMut
for your lexer.
use std::ops::Deref;
use std::ops::DerefMut;
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
impl Deref for Lexer {
type Target = Flexer<LexerState,TokenStream,Logger> ;
fn deref(&self) -> &Self::Target {
&self.lexer
}
}
impl DerefMut for Lexer {
fn deref_mut(&mut self) -> &mut Self::Target {
&mut self.lexer
}
}
You'll note that here we've instantiated the flexer with a Logger
. This is used for providing
debug information during development, and can be accessed from all scopes of your lexer. In
release mode, however, logging calls at the "trace", "debug", and "info" levels are optimised
away.
Defining the Lexing Rules
Flexer-based lexers operate by matching on a series of automata::pattern::Pattern
s that
describe the language that it is trying to lex. It combines these patterns with "transition
functions" that may execute arbitrary code when a pattern matches on the lexer's input.
In order to define the lexing rules, we need to implement Definition
for our lexer,
particularly the [Definition::define()
] function.
use enso_flexer::automata::pattern::Pattern;
use enso_flexer::group::Registry;
use enso_flexer;
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
impl enso_flexer::Definition for Lexer {
fn define() -> Self {
// First we instantiate our lexer. Definitions take place _directly_ on the lexer, and
// manipulate runtime state.
let mut lexer = Self::new();
// Then, we define the patterns that we're going to use. For an overview of the p
let a_word = Pattern::char('a').many1();
let b_word = Pattern::char('b').many1();
let space = Pattern::char(' ');
let spaced_a_word = &space >> &a_word;
let spaced_b_word = &space >> &b_word;
let any = Pattern::any();
let end = Pattern::eof();
// Next, we define groups of lexer rules. This uses the groups that we've defined in our
// lexer's state, and the patterns we've defined above.
let root_group_id = lexer.initial_state;
let root_group = lexer.groups_mut().group_mut(root_group_id);
root_group.create_rule(&a_word,"self.on_first_word(reader)");
root_group.create_rule(&b_word,"self.on_first_word(reader)");
root_group.create_rule(&end, "self.on_no_err_suffix_first_word(reader)");
root_group.create_rule(&any, "self.on_err_suffix_first_word(reader)");
let seen_first_word_group_id = lexer.seen_first_word_state;
let seen_first_word_group = lexer.groups_mut().group_mut(seen_first_word_group_id);
seen_first_word_group.create_rule(&spaced_a_word,"self.on_spaced_word(reader)");
seen_first_word_group.create_rule(&spaced_b_word,"self.on_spaced_word(reader)");
seen_first_word_group.create_rule(&end, "self.on_no_err_suffix(reader)");
seen_first_word_group.create_rule(&any, "self.on_err_suffix(reader)");
lexer
}
/// This function just returns the lexer's groups.
fn groups(&self) -> &Registry {
self.lexer.groups()
}
/// Code you want to run before lexing begins.
fn set_up(&mut self) {}
/// Code you want to run after lexing finishes.
fn tear_down(&mut self) {}
}
Transition Functions
You may be wondering why the transition functions are specified as strings. This allows us to generate highly-efficient, specialized code for your lexer once you define it. More on this later.
A group::Group
in the lexer is like a state that operates on a stack. A transition function
can arbitrarily activate or deactivate a group on the flexer's stack, allowing you to perform
context-sensitive lexing behaviour. For more information (including on how to use parent groups
to inherit rules), see the relevant module.
For more information on the automata::pattern::Pattern
APIs used above, please see the
relevant module in this crate.
Defining the Transition Functions
You'll have noticed that, up above, we told the rules to use a bunch of transition functions that we've not yet talked about. These functions can be defined anywhere you like, as long as they are in scope in the file in which you are defining your lexer. We do, however, recommend defining them on your lexer itself, so they can access and manipulate lexer state, so that's what we're going to do here.
use enso_flexer::prelude::ReaderOps;
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
impl Lexer {
pub fn on_first_word<R:ReaderOps>(&mut self, _reader:&mut R) {
let str = self.current_match.clone();
let ast = Token::Word(str);
self.output.push(ast);
let id = self.seen_first_word_state;
self.push_state(id);
}
pub fn on_spaced_word<R:ReaderOps>(&mut self, _reader:&mut R) {
let str = self.current_match.clone();
let ast = Token::Word(String::from(str.trim()));
self.output.push(ast);
}
pub fn on_err_suffix_first_word<R:ReaderOps>(&mut self, _reader:&mut R) {
let ast = Token::Unrecognized(self.current_match.clone());
self.output.push(ast);
}
pub fn on_err_suffix<R:ReaderOps>(&mut self, reader:&mut R) {
self.on_err_suffix_first_word(reader);
self.pop_state();
}
pub fn on_no_err_suffix_first_word<R:ReaderOps>(&mut self, _reader:&mut R) {}
pub fn on_no_err_suffix<R:ReaderOps>(&mut self, reader:&mut R) {
self.on_no_err_suffix_first_word(reader);
self.pop_state();
}
}
Magic Transition Functions
The transition functions are the 'secret sauce', so to speak, of the Flexer. They are called when a rule matches, and allow arbitrary code to manipulate the lexer. This means that the flexer can be used to define very complex grammars while still keeping a simple interface and ensuring performant execution.
You'll note that all of these functions have a couple of things in common:
- They have a type parameter
R
that conforms to theprelude::LazyReader
trait. - They take an argument of type
R
, that is the reader over which the lexer is running as the first non-self
argument to the function. - Any additional arguments must be valid in the scope in which the specialisation rules are going to be generated.
Both of these, combined, allow the transition functions to manipulate the text being read by the lexer.
Specializing the Lexer
In order to actually use the lexer that you've defined, you need to specialize it to the rules
that you define. Unfortunately, cargo
doesn't have support for post-build hooks, and so this
is a little more involved than we'd like it to be.
- Create a file that performs the definition of the lexer as above. It can use multiple files in its crate as long as they are publicly exposed.
- Create a separate cargo project that has a prebuild hook in its
build.rs
. - In that build.rs, you need to:
- Import the lexer definition and instantiate it using
::define()
. - Call [
State::specialize()
] on the resultant lexer. This will generate a string that contains the optimised lexer implementation. - Write both the generated code and the code from the original lexer definition into an output file.
- Import the lexer definition and instantiate it using
- Re-export this output file from your cargo project's
lib.rs
.
The process of specialization will generate quite a bit of code, but most importantly it will
generate pub fn run<R:LazyReader>(&mut self, mut reader:R) -> Result<Output>
, where Output
is your lexer's token type. All of these functions are defined on your lexer type (the one whose
name is provided to specialize()
.
In Summary
The flexer allows its clients to define highly optimised lexer implementations that are capable of lexing languages of a high complexity.
Dependencies
~17–31MB
~468K SLoC