1 unstable release
Uses new Rust 2024
new 0.9.0 | May 16, 2025 |
---|
#38 in Parser tooling
730KB
4K
SLoC
Lexx
A fast, extensible, greedy, single-pass text tokenizer implemented in Rust. Lexxor is designed for high-performance tokenization with minimal memory allocations, making it suitable for parsing large files or real-time text processing.
Overview
Lexxor is a tokenizer library that allows you to define and compose various token matching strategies. It processes input character-by-character, identifying the longest possible match at each position using a set of configurable matchers. It includes a precedence mechanism for resolving matcher conflicts.
Key Features
- High Performance: Single-pass tokenization with minimal memory allocations
- Flexible Matching: Composable matcher system with precedence control
- Zero-Copy Design: Uses ArrayVec for efficient memory management
- Rich Token Information: Tokens include type, value, line, and column information
- Extensible: Create custom matchers for domain-specific tokenization needs
- Iterator Interface: Simple integration with Rust's iterator ecosystem
Architecture
Lexxor consists of four main components:
- LexxorInput: Provides a stream of characters from various sources
- Matchers: Identify specific patterns in the input (words, numbers, symbols, etc.)
- Tokens: Represent the results of successful matches
- Lexxor Engine: Orchestrates the tokenization process
Built-in Matchers
Lexxor provides several built-in matchers for common token types:
WordMatcher
: Matches alphabetic wordsIntegerMatcher
: Matches integer numbersFloatMatcher
: Matches floating-point numbersSymbolMatcher
: Matches non-alphanumeric symbolsWhitespaceMatcher
: Matches whitespace characters (spaces, tabs, newlines)KeywordMatcher
: Matches specific keywords (but not as substrings)ExactMatcher
: Matches exact string patterns (operators, delimiters, etc.)
Precedence System
Matchers can be assigned precedence values to resolve conflicts when multiple matchers could match the same input. This allows for sophisticated tokenization strategies, such as recognizing keywords as distinct from regular words.
Usage Examples
Basic Tokenization
use Lexxor::Lexx;
use Lexxor::input::InputString;
use Lexxor::matcher::word::WordMatcher;
use Lexxor::matcher::whitespace::WhitespaceMatcher;
use Lexxor::matcher::symbol::SymbolMatcher;
use Lexxor::matcher::integer::IntegerMatcher;
use Lexxor::matcher::float::FloatMatcher;
fn main() {
// Create a simple input string
let input_text = "Hello world! This is 42 and 3.14159.";
let input = InputString::new(input_text.to_string());
// Create a Lexxor tokenizer with standard matchers
let lexx = Lexx::<512>::new(
Box::new(input),
vec![
Box::new(WhitespaceMatcher { index: 0, column: 0, line: 0, precedence: 0, running: true }),
Box::new(WordMatcher { index: 0, precedence: 0, running: true }),
Box::new(IntegerMatcher { index: 0, precedence: 0, running: true }),
Box::new(FloatMatcher { index: 0, precedence: 0, dot: false, float: false, running: true }),
Box::new(SymbolMatcher { index: 0, precedence: 0, running: true }),
]
);
// Process tokens using the Iterator interface
for token in lexx {
println!("{}", token);
}
}
Custom Matchers
You can create custom matchers by implementing the Matcher
trait:
use Lexxor::matcher::{Matcher, MatcherResult};
use Lexxor::token::{Token, TOKEN_TYPE_CUSTOM};
use std::collections::HashMap;
use std::fmt::Debug;
// Define a custom token type
const TOKEN_TYPE_HEX_COLOR: u16 = 200;
#[derive(Debug)]
struct HexColorMatcher {
index: usize,
precedence: u8,
running: bool,
}
impl Matcher for HexColorMatcher {
fn reset(&mut self, _ctx: &mut Box<HashMap<String, i32>>) {
self.index = 0;
self.running = true;
}
fn find_match(
&mut self,
oc: Option<char>,
value: &[char],
_ctx: &mut Box<HashMap<String, i32>>,
) -> MatcherResult {
// Implementation for matching hex color codes
// ...
}
fn is_running(&self) -> bool {
self.running
}
fn precedence(&self) -> u8 {
self.precedence
}
}
Performance
Lexxor is optimized for high-performance tokenization:
Benchmark | Time |
---|---|
Small file (15 bytes) | ~1.2 µs |
UTF-8 sample (13 KB) | ~350 µs |
Large file (1.8 MB) | ~45 ms |
These benchmarks were measured on standard hardware. Your results may vary depending on your system specifications.
Performance Considerations
- Lexxor uses a fixed-size buffer for token storage, specified as
Lexx<CAP>
whereCAP
is the maximum token size - If a token exceeds this size, Lexxor will panic
- Choose an appropriate buffer size for your use case to balance memory usage and token size limits
Installation
Add Lexxor to your Cargo.toml
:
[dependencies]
lexxor = "0.1.0"
Token Types
Lexxor defines several standard token types:
TOKEN_TYPE_WHITESPACE
(3): Whitespace charactersTOKEN_TYPE_WORD
(4): Word tokens (alphabetic characters)TOKEN_TYPE_INTEGER
(1): Integer numbersTOKEN_TYPE_FLOAT
(2): Floating point numbersTOKEN_TYPE_SYMBOL
(5): Symbol charactersTOKEN_TYPE_EXACT
(6): Exact string matchesTOKEN_TYPE_KEYWORD
(7): Reserved keywords
You can define custom token types starting from higher numbers (e.g., 100+) for your application-specific needs.
Input Sources
Lexxor supports multiple input sources through the LexxorInput
trait:
InputString
: Tokenize from a StringInputReader
: Tokenize from any source implementingRead
You can implement custom input sources by implementing the LexxorInput
trait.
Error Handling
Lexxor returns LexxError
in two cases:
TokenNotFound
: No matcher could match the current inputError
: Some other error occurred during tokenization
To successfully parse an entire input, ensure you have matchers that can handle all possible character sequences.
License
MIT License
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.