1 unstable release
Uses new Rust 2024
new 0.1.0 | Mar 21, 2025 |
---|
#33 in #lexer-tokenizer
55KB
630 lines
Tokenise
A flexible lexical analyser (tokeniser) for parsing text into configurable token types.
Overview
tokenise
allows you to split text into tokens based on customisable rules for special characters, delimiters, and comments. It's designed to be flexible enough to handle various syntax styles while remaining simple to configure.
Features
- Unicode support (using grapheme clusters)
- Configurable special characters and delimiters
- Support for paired delimiters (e.g., parentheses, brackets)
- Support for balanced delimiters (e.g., quotation marks)
- Single-line and multi-line comment handling
- Whitespace and newline preservation
Usage
Add this to your Cargo.toml
:
[dependencies]
tokenise = "0.1.0"
Basic Example
use tokenise::{Tokeniser, TokenState};
fn main() {
// Create a new tokeniser
let mut tokeniser = Tokeniser::new();
// Configure tokeniser with rules
tokeniser.add_specials(".,;:!?");
tokeniser.add_delimiter_pairs(&vec!["()", "[]", "{}"]).unwrap();
tokeniser.add_balanced_delimiter("\"").unwrap();
tokeniser.set_sl_comment("//").unwrap();
tokeniser.set_ml_comment("/*", "*/").unwrap();
// Tokenise some source text
let source = "let x = 42; // The answer\nprint(\"Hello world!\");";
let tokens = tokeniser.tokenise(source).unwrap();
// Work with the resulting tokens
for token in tokens {
println!("{:?}: '{}'", token.get_state(), token.value());
}
}
Token Types
The tokeniser recognises several token types represented by the TokenState
enum:
Word
: Non-special character sequencesLDelimiter
/RDelimiter
: Left/right delimiters of a pair (e.g., '(', ')')BDelimiter
: Balanced delimiters (e.g., quotation marks)SymbolString
: Special charactersNewLine
: Line breaksWhiteSpace
: Spaces, tabs, etc.SLComment
: Single-line commentsMLComment
: Multi-line comments
License
This project is licensed under the MIT License - see the LICENSE file for details.
Dependencies
~355KB