2 releases
0.1.1 | Jan 14, 2024 |
---|---|
0.1.0 | Jan 14, 2024 |
#12 in #lexical-analysis
22KB
239 lines
Lexer
My personal implementation of a lexer.
Principles
The lexer is plugin based. This is not a parser nor a compiler.
Tokens
There are 8 premade kinds of token (examples are not mandatory):
TokenKind |
Explanation | Examples |
---|---|---|
KEYWORD |
Reserved words | if return ... |
DELIMITER |
Paired delimiter symbols | () [] {} ... |
PUNCTUATION |
Punctuation symbols | ; . ... |
OPERATOR |
Symbols that operates on arguments | + - = ... |
COMMENT |
Line or block comments | // /* ... */ ... |
WHITESPACE |
Non-printable characters | - |
LITERAL |
Numerical, logical, textual values | 1 true "true" ... |
IDENTIFIER |
Names assigned in a program | x temp PRINT ... |
These token kinds (except IDENTIFIER
) should be constructed with a name that
can be used to differentiate tokens with same kind.
Each TokenKind
can be associated with one or more Pattern
that match them with a string
through a Tokenizer
, giving a Token
.
Lexer
The Lexer
should be constructed with a LexerBuilder
that wraps several Tokenizer
.
Examples
Simple maths Lexer
let plus = Tokenizer::new(TokenKind::OPERATOR("PLUS"), '+');
let minus = Tokenizer::new(TokenKind::OPERATOR("MINUS"), '-');
let star = Tokenizer::new(TokenKind::OPERATOR("STAR"), '*');
let slash = Tokenizer::new(TokenKind::OPERATOR("SLASH"), '/');
let equal = Tokenizer::new(TokenKind::OPERATOR("EQUAL"), '=');
let number = Tokenizer::new(TokenKind::LITERAL("NUMBER"), |s: &str| {
let mut dot_seen = false;
for ch in s.chars() {
if !ch.is_digit(10) && (ch != '.' || dot_seen) {
return false;
} else if ch == '.' {
dot_seen = true;
}
}
true
});
let id_regex = Regex::new(r"[a-zA-Z_$][a-zA-Z_$0-9]*").unwrap();
let id = Tokenizer::new(TokenKind::IDENTIFIER, id_regex);
let whitespace = Tokenizer::new(TokenKind::WHITESPACE("SPACE"), ' ');
let lexer = Lexer::builder()
.extend(vec![plus, minus, star, slash, equal, number, id, whitespace])
.build();
lexer.tokenize("x_4 = 2 + 2 = 4 * 0.5")?;
/* [Token { kind: IDENTIFIER, value: "x_4" },
Token { kind: WHITESPACE("SPACE"), value: " " },
Token { kind: OPERATOR("EQUAL"), value: "=" },
Token { kind: WHITESPACE("SPACE"), value: " " },
Token { kind: LITERAL("NUMBER"), value: "2" },
Token { kind: WHITESPACE("SPACE"), value: " " },
Token { kind: OPERATOR("PLUS"), value: "+" },
Token { kind: WHITESPACE("SPACE"), value: " " },
Token { kind: LITERAL("NUMBER"), value: "2" },
Token { kind: WHITESPACE("SPACE"), value: " " },
Token { kind: OPERATOR("EQUAL"), value: "=" },
Token { kind: WHITESPACE("SPACE"), value: " " },
Token { kind: LITERAL("NUMBER"), value: "4" },
Token { kind: WHITESPACE("SPACE"), value: " " },
Token { kind: OPERATOR("STAR"), value: "*" },
Token { kind: WHITESPACE("SPACE"), value: " " },
Token { kind: LITERAL("NUMBER"), value: "0.5" }] */
Dependencies
~2.3–3.5MB
~57K SLoC