1 stable release

1.1.0 Oct 30, 2023
1.0.13 Oct 29, 2023

#387 in Text processing

22 downloads per month
Used in chisel-parsers

MIT/Apache

11MB
2K SLoC

chisel-lexers

Workflow Status

crates.io

crates.io

Overview

This crate contains the lexical analysis backends used by chisel. The basic architecture is very simple - multiple lexers using a common scanning implementation.

A scanner consumes characters from an underlying source of characters, and keeps track of the position where the character was read. It also provides some basic buffering and lookahead/pushback functionality.

It's always assumed that input is read linearly and can only be read once, from start to finish.

A lexer consumes from a scanner, and attempts to construct tokens which may be consumed by parsers further up the stack.

A lexer defines and is capable of producing its own set of distinct tokens specific to the parsing task in hand. (For example, the JSON lexer produces JSON-specific tokens only).

Scanning the input

The scanner operates through maintaining a simple internal state:

  • A current position in the input
  • An input buffer used to control pushbacks and lookaheads
  • An accumulation buffer for gathering up characters

A lexer simply pulls characters through the scanner (which adds positional information to each one) and gathers them up within the accumulation buffer until it sees something that triggers the parse of a valid token.

Once the lexer is ready to consume all the content in the accumulation buffer, functions are provided to extract the contents of the buffer in a number of formats (e.g. a string or char array) and to then clear the buffer without resetting all the internal scanner state.

A simple example of using the scanner is shown below:

 use std::io::BufReader;
 use chisel_common::reader_from_bytes;
 use chisel_decoders::utf8::Utf8Decoder;
 use chisel_lexers::scanner::Scanner;

 // construct a new scanner instance, based on a decoded byte source
 let buffer: &[u8] = "let goodly sin and sunshine in".as_bytes();
 let mut reader = BufReader::new(buffer);
 let mut decoder = Utf8Decoder::new(&mut reader);
 let mut scanner = Scanner::new(&mut decoder);

// consume the first character from the scanner...
let first = scanner.advance(true);
assert!(first.is_ok());
assert_eq!(scanner.front().unwrap().ch, 'l');
assert_eq!(scanner.front().unwrap().coords.column, 1);

// consume a second character
assert!(scanner.advance(true).is_ok());

// ...and then pushback onto the buffer
scanner.pushback();

// front of the buffer should still be 'l'
assert_eq!(scanner.front().unwrap().ch, 'l');

// advance again - this time char will be taken from the pushback buffer
let _ = scanner.advance(true);
assert_eq!(scanner.front().unwrap().ch, 'e');

// grab the contents of the buffer as a string
let buffer_contents= scanner.buffer_as_string_with_span();
assert_eq!(buffer_contents.str, String::from("le"));

// reset the scanner and empty the buffer
scanner.clear();

// buffer should now be empty
assert!(scanner.buffer_as_string_with_span().str.is_empty());

// advance yet again
assert!(scanner.advance(true).is_ok());

// the third character read will be from the 3rd column in the input
assert_eq!(scanner.front().unwrap().ch, 't');
assert_eq!(scanner.front().unwrap().coords.column, 3);


Lexers

Within the current release, only a single lexer backend is implemented within this crate:

JSON Lexer

Building and Testing

What Command
Build crate cargo build
Test crate cargo test
Run all benchmarks cargo bench

This crate comes contains two specific benchmarks which may be run in isolation:

Description Command
JSON lexing (tokenisation) benchmarks cargo bench --bench json_lexer
Scanning (char munching) benchmarks cargo bench --bench scanner

Suggestions and Requests

If you have any suggestions, requests or even just comments relating to this crate, then please just add an issue and I'll try and take a look when I get change. Please feel free to fork this repo if you want to utilise/modify this code in any of your own work.

Dependencies