9 releases

0.1.4 Mar 11, 2022
0.1.3 Jan 17, 2021
0.1.2 May 1, 2019
0.0.8 Feb 21, 2019
0.0.4 May 31, 2016

#39 in Parser tooling

Download history 24/week @ 2023-11-26 12/week @ 2023-12-03 23/week @ 2023-12-10 26/week @ 2023-12-17 20/week @ 2023-12-24 1/week @ 2023-12-31 23/week @ 2024-01-07 18/week @ 2024-01-14 9/week @ 2024-01-21 12/week @ 2024-01-28 22/week @ 2024-02-04 35/week @ 2024-02-11 44/week @ 2024-02-18 103/week @ 2024-02-25 90/week @ 2024-03-03 34/week @ 2024-03-10

276 downloads per month
Used in 8 crates (7 directly)

MIT license

33KB
864 lines

Documentation

Tokenizers

This crate provides multiple tokenizers built on top of Scanner.

  • EbnfTokenizer: A tokenizing an EBNF grammar.
let grammar = r#"
    expr   := expr ('+'|'-') term | term ;
    term   := term ('*'|'/') factor | factor ;
    factor := '-' factor | power ;
    power  := ufact '^' factor | ufact ;
    ufact  := ufact '!' | group ;
    group  := num | '(' expr ')' ;
"#;
let mut tok = EbnfTokenizer::new(grammar.chars())
  • LispTokenizer: for tokenizing lisp like input.
LispTokenizer::new("(+ 3 4 5)".chars());
  • MathTokenizer: emits MathToken tokens.
MathTokenizer::new("3.4e-2 * sin(x)/(7! % -4)".chars());
  • DelimTokenizer: emits tokens split by some delimiter.

Scanner

Scanner is the building block for implementing tokenizers. You can build one from an Iterator and use it to extract tokens. Check the above mentioned tokenizers for examples.

Example

// Define a Tokenizer
struct Tokenizer<I: Iterator<Item=char>>(lexers::Scanner<I>);

impl<I: Iterator<Item=char>> Iterator for Tokenizer<I> {
    type Item = String;
    fn next(&mut self) -> Option<Self::Item> {
        self.0.scan_whitespace();
        self.0.scan_math_op()
            .or_else(|| self.0.scan_number())
            .or_else(|| self.0.scan_identifier())
    }
}

fn tokenizer<I: Iterator<Item=char>>(input: I) -> Tokenizer<I> {
    Tokenizer(lexers::Scanner::new(input))
}

// Use it to tokenize a math expression
let mut lx = tokenizer("3+4*2/-(1-5)^2^3".chars());
let token = lex.next();

Tips

  • scan_X functions try to consume some text-object out of the scanner. For example numbers, identifiers, quoted strings, etc.

  • buffer_pos and set_buffer_pos are used for back-tracking as long as the Scanner's buffer still has the data you need. That means you haven't consumed or discarded it.

No runtime deps