1 unstable release

0.1.0 Mar 22, 2023

#136 in #tokenizer

Used in regex-tokenizer

MIT license

170 lines


Alla pugna! Crates.io (latest)

A regex-based tokenizer with a minimal DSL to define it!


tokenizer! {

    r"[a-zA-Z]\w*" => Identifier
    r"\d+" => Number
    r"\s+" => _

And, in a function

let tokenizer = SimpleTokenizer::new();

SimpleTokenizer will generate an enum called SimpleTokenyzer_types, containing Identifier and Number. Regexes with _ as class are ignored; when a substring that does not match a specified regex is found, the tokenization is considered failed.

When multiple non-ignored regexes match with an input, priority is given to the one defined first.

Calling tokenizer.tokenize(...) will return an iterator that extracts tokens from the query. A token is formed by:

    value: String,
    position: usize,
    type_: SimpleTokenyzer_types,

position will be the position of the token's first character inside the query. A call to .next() will return None if there are no more tokens to extract.


~55K SLoC