#input #tokenize #scan #text #text-parser

scanlex

a simple lexical scanner for parsing text into tokens

5 releases

Uses old Rust 2015

0.1.4 Nov 8, 2020
0.1.3 Oct 22, 2020
0.1.2 Mar 15, 2018
0.1.1 Mar 15, 2018
0.1.0 Dec 14, 2016

#452 in Text processing

Download history 2666/week @ 2023-12-14 1699/week @ 2023-12-21 2412/week @ 2023-12-28 3272/week @ 2024-01-04 3835/week @ 2024-01-11 4323/week @ 2024-01-18 2760/week @ 2024-01-25 3889/week @ 2024-02-01 4467/week @ 2024-02-08 2919/week @ 2024-02-15 3420/week @ 2024-02-22 3429/week @ 2024-02-29 3136/week @ 2024-03-07 3239/week @ 2024-03-14 4633/week @ 2024-03-21 2608/week @ 2024-03-28

14,469 downloads per month
Used in 35 crates (via chrono-english)

MIT license

34KB
689 lines

scanlex - a simple lexical scanner.

The Problem of Input

It is easier to write things out than to read them in, since more things can go wrong. The read may fail, the text may not be valid UTF-8, the number may be malformed or simply out of range.

Lexical Scanners

Lexical scanners split a stream of characters into tokens. Tokens are returned by repeatedly calling the get method of Scanner, (which will return Token::End if no tokens are left) or by iterating over the scanner. They represent numbers, characters, identifiers, or single/double quoted strings. There is also Token::Error to indicate a badly formed token.

This lexical scanner makes some assumptions, such as a number may not be directly followed by a letter, etc. No attempt is made in this version to decode C-style escape codes in strings. All whitespace is ignored. It's intended for processing generic structured data, rather than code.

For example, the string "hello 'dolly' * 42" will be broken into four tokens:

  • an identifier 'hello'
  • a quoted string 'dolly'
  • a character '*'
  • and a number 42
extern crate scanlex;
use scanlex::{Scanner,Token};

let mut scan = Scanner::new("hello 'dolly' * 42");
assert_eq!(scan.get(),Token::Iden("hello".into()));
assert_eq!(scan.get(),Token::Str("dolly".into()));
assert_eq!(scan.get(),Token::Char('*'));
assert_eq!(scan.get(),Token::Int(10));
assert_eq!(scan.get(),Token::End);

To extract the values, use code like this:

let greeting = scan.get_iden()?;
let person = scan.get_string()?;
let op = scan.get_char()?;
let answer = scan.get_integer(); // i64

Scanner implements Iterator. If you just wanted to extract the words from a string, then filtering with as_iden will do the trick, since it returns Option<String>.

let s = Scanner::new("bonzo 42 dog (cat)");
let v: Vec<_> = s.filter_map(|t| t.as_iden()).collect();
assert_eq!(v,&["bonzo","dog","cat"]);

Using as_number instead you can use this strategy to extract all the numbers out of a document, ignoring all other structure. The scan.rs example shows you the tokens that would be generated by parsing the given string on the commmand-line.

This iterator only stops at Token::End - you can handle Token::Error yourself.

Usually it's important not to ignore structure. Say we have input strings that look like this "(WORD) = NUMBER":

	scan.skip_chars("(")?;
	let word = scan.get_iden()?;
	scan.skip_chars(")=")?;
	let num = scan.get_number()?;

Any of these calls may fail!

It is a common pattern to create a scanner for each line of text read from a readable source. The scanline.rs example shows how to use ScanLines to accomplish this.

    let f = File::open("scanline.rs").expect("cannot open scanline.rs");
    let mut iter = ScanLines::new(&f);
    while let Some(s) = iter.next() {
        let mut s = s.expect("cannot read line");
        // show the first token of each line
        println!("{:?}",s.get());
    }

A more serious example (taken from the tests) is parsing JSON:

type JsonArray = Vec<Box<Value>>;
type JsonObject = HashMap<String,Box<Value>>;

#[derive(Debug, Clone, PartialEq)]
pub enum Value {
   Str(String),
   Num(f64),
   Bool(bool),
   Arr(JsonArray),
   Obj(JsonObject),
   Null
}

fn scan_json(scan: &mut Scanner) -> Result<Value,ScanError> {
    use Value::*;
    match scan.get() {
        Token::Str(s) => Ok(Str(s)),
        Token::Num(x) => Ok(Num(x)),
        Token::Int(n) => Ok(Num(n as f64)),
        Token::End => Err(scan.scan_error("unexpected end of input",None)),
        Token::Error(e) => Err(e),
        Token::Iden(s) =>
            if s == "null"    {Ok(Null)}
            else if s == "true" {Ok(Bool(true))}
            else if s == "false" {Ok(Bool(false))}
            else {Err(scan.scan_error(&format!("unknown identifier '{}'",s),None))},
        Token::Char(c) =>
            if c == '[' {
                let mut ja = Vec::new();
                let mut ch = c;
                while ch != ']' {
                    let o = scan_json(scan)?;
                    ch = scan.get_ch_matching(&[',',']'])?;
                    ja.push(Box::new(o));
                }
                Ok(Arr(ja))
            } else
            if c == '{' {
                let mut jo = HashMap::new();
                let mut ch = c;
                while ch != '}' {
                    let key = scan.get_string()?;
                    scan.get_ch_matching(&[':'])?;
                    let o = scan_json(scan)?;
                    ch = scan.get_ch_matching(&[',','}'])?;
                    jo.insert(key,Box::new(o));
                }
                Ok(Obj(jo))
            } else {
                Err(scan.scan_error(&format!("bad char '{}'",c),None))
            }
    }
}

(This is of course an Illustrative Example. JSON is a solved problem.)

Options

With no_float you get a barebones parser that does not recognize floats, just integers, strings, chars and identifiers. This is useful if the existing rules are too strict - e.g "2d" is fine in no_float mode, but an error in the default mode. chrono-english uses this mode to parse date expressions.

With line_comment you provide a character; after this character, the rest of the current line will be ignored.

No runtime deps