#tokenizer #scanner #token #lexical-analysis #basic #compiling #analyzer

basic_lexer

Basic lexical analyzer for parsing and compiling

4 releases

0.2.1 Jan 13, 2022
0.2.0 Jan 10, 2022
0.1.3 Nov 15, 2021

#8 in #lexical-analysis

MIT license

41KB
724 lines

Basic lexical analyzer for parsing and compiling.


lib.rs:

basic_lexer is a basic lexical scanner designed for the first stage of compiler construction, and produces tokens required by a parser. It was originally intended to support the parallel project rustlr, which is a LR-style parser generator, although each project is independent of the other.

For version 0.2.0, a new "zero-copy" tokenizer has been added, consisting of [RawToken], [StrTokenizer] and [LexSource]. The most important structure is [StrTokenizer]. The original tokenizer and related constructs, which produced tokens containing owned strings, is still present. However, neither tokenizer is optimal-performance in that they are not built from DFAs. The new tokenizing function, StrTokenizer::next_token, uses regex, and now becomes the focus of the crate. It is now capaple of counting whitespaces (for Python-like languages) and accurately keeps track of the starting line/column position of each token.

Example: given the Cargo.toml file of this crate,

  let source = LexSource::new("Cargo.toml").unwrap();
  let mut tokenizer = StrTokenizer::from_source(&source);
  tokenizer.set_line_comment("#");
  tokenizer.keep_comment=true;
  tokenizer.keep_newline=false;
  tokenizer.keep_whitespace=false; 
  while let Some(token) = tokenizer.next() {
     println!("Token: {:?}",&token);
  } 

This code produces output

 Token: (Symbol("["), 1, 1)
 Token: (Alphanum("package"), 1, 2) 
 Token: (Symbol("]"), 1, 9)
 Token: (Alphanum("name"), 2, 1)
 Token: (Symbol("="), 2, 6)
 Token: (Strlit("\"basic_lexer\""), 2, 8)
 Token: (Alphanum("version"), 3, 1)
 Token: (Symbol("="), 3, 9)
 Token: (Strlit("\"0.2.0\""), 3, 11)
 Token: (Alphanum("edition"), 4, 1)
 Token: (Symbol("="), 4, 9)
 Token: (Strlit("\"2018\""), 4, 11)
 ...
 Token: (Symbol("]"), 8, 35)
 Token: (Verbatim("# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html"), 10, 1)

etc.. The numbers returned alongside each token represent the line and column positions of the start of the token.

Dependencies

~2.1–3MB
~53K SLoC