4 releases
0.2.1 | Jan 13, 2022 |
---|---|
0.2.0 | Jan 10, 2022 |
0.1.3 | Nov 15, 2021 |
#12 in #lexical-analysis
41KB
724 lines
Basic lexical analyzer for parsing and compiling.
lib.rs
:
basic_lexer is a basic lexical scanner designed for the first stage of compiler construction, and produces tokens required by a parser. It was originally intended to support the parallel project rustlr, which is a LR-style parser generator, although each project is independent of the other.
For version 0.2.0, a new "zero-copy" tokenizer has been added, consisting of [RawToken], [StrTokenizer] and [LexSource]. The most important structure is [StrTokenizer]. The original tokenizer and related constructs, which produced tokens containing owned strings, is still present. However, neither tokenizer is optimal-performance in that they are not built from DFAs. The new tokenizing function, StrTokenizer::next_token, uses regex, and now becomes the focus of the crate. It is now capaple of counting whitespaces (for Python-like languages) and accurately keeps track of the starting line/column position of each token.
Example: given the Cargo.toml file of this crate,
let source = LexSource::new("Cargo.toml").unwrap();
let mut tokenizer = StrTokenizer::from_source(&source);
tokenizer.set_line_comment("#");
tokenizer.keep_comment=true;
tokenizer.keep_newline=false;
tokenizer.keep_whitespace=false;
while let Some(token) = tokenizer.next() {
println!("Token: {:?}",&token);
}
This code produces output
Token: (Symbol("["), 1, 1)
Token: (Alphanum("package"), 1, 2)
Token: (Symbol("]"), 1, 9)
Token: (Alphanum("name"), 2, 1)
Token: (Symbol("="), 2, 6)
Token: (Strlit("\"basic_lexer\""), 2, 8)
Token: (Alphanum("version"), 3, 1)
Token: (Symbol("="), 3, 9)
Token: (Strlit("\"0.2.0\""), 3, 11)
Token: (Alphanum("edition"), 4, 1)
Token: (Symbol("="), 4, 9)
Token: (Strlit("\"2018\""), 4, 11)
...
Token: (Symbol("]"), 8, 35)
Token: (Verbatim("# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html"), 10, 1)
etc.. The numbers returned alongside each token represent the line and column positions of the start of the token.
Dependencies
~2–3MB
~53K SLoC