5 releases
Uses old Rust 2015
0.1.4 | Nov 8, 2020 |
---|---|
0.1.3 | Oct 22, 2020 |
0.1.2 | Mar 15, 2018 |
0.1.1 | Mar 15, 2018 |
0.1.0 | Dec 14, 2016 |
#15 in #tokenize
19,428 downloads per month
Used in 37 crates
(2 directly)
34KB
689 lines
scanlex - a simple lexical scanner.
The Problem of Input
It is easier to write things out than to read them in, since more things can go wrong. The read may fail, the text may not be valid UTF-8, the number may be malformed or simply out of range.
Lexical Scanners
Lexical scanners split a stream of characters into tokens.
Tokens are returned by repeatedly calling the get
method of Scanner
,
(which will return Token::End
if no tokens are left)
or by iterating over the scanner. They represent numbers, characters, identifiers,
or single/double quoted strings. There is also Token::Error
to
indicate a badly formed token.
This lexical scanner makes some assumptions, such as a number may not be directly followed by a letter, etc. No attempt is made in this version to decode C-style escape codes in strings. All whitespace is ignored. It's intended for processing generic structured data, rather than code.
For example, the string "hello 'dolly' * 42" will be broken into four tokens:
- an identifier 'hello'
- a quoted string 'dolly'
- a character '*'
- and a number 42
extern crate scanlex;
use scanlex::{Scanner,Token};
let mut scan = Scanner::new("hello 'dolly' * 42");
assert_eq!(scan.get(),Token::Iden("hello".into()));
assert_eq!(scan.get(),Token::Str("dolly".into()));
assert_eq!(scan.get(),Token::Char('*'));
assert_eq!(scan.get(),Token::Int(10));
assert_eq!(scan.get(),Token::End);
To extract the values, use code like this:
let greeting = scan.get_iden()?;
let person = scan.get_string()?;
let op = scan.get_char()?;
let answer = scan.get_integer(); // i64
Scanner
implements Iterator
. If you just wanted to extract the words from
a string, then filtering with as_iden
will do the trick, since it returns
Option<String>
.
let s = Scanner::new("bonzo 42 dog (cat)");
let v: Vec<_> = s.filter_map(|t| t.as_iden()).collect();
assert_eq!(v,&["bonzo","dog","cat"]);
Using as_number
instead you can use this strategy to extract all the numbers out of a
document, ignoring all other structure. The scan.rs
example shows you the tokens
that would be generated by parsing the given string on the commmand-line.
This iterator only stops at Token::End
- you can handle Token::Error
yourself.
Usually it's important not to ignore structure. Say we have input strings that look like this "(WORD) = NUMBER":
scan.skip_chars("(")?;
let word = scan.get_iden()?;
scan.skip_chars(")=")?;
let num = scan.get_number()?;
Any of these calls may fail!
It is a common pattern to create a scanner for each line of text read from a readable
source. The scanline.rs
example shows how to use ScanLines
to accomplish this.
let f = File::open("scanline.rs").expect("cannot open scanline.rs");
let mut iter = ScanLines::new(&f);
while let Some(s) = iter.next() {
let mut s = s.expect("cannot read line");
// show the first token of each line
println!("{:?}",s.get());
}
A more serious example (taken from the tests) is parsing JSON:
type JsonArray = Vec<Box<Value>>;
type JsonObject = HashMap<String,Box<Value>>;
#[derive(Debug, Clone, PartialEq)]
pub enum Value {
Str(String),
Num(f64),
Bool(bool),
Arr(JsonArray),
Obj(JsonObject),
Null
}
fn scan_json(scan: &mut Scanner) -> Result<Value,ScanError> {
use Value::*;
match scan.get() {
Token::Str(s) => Ok(Str(s)),
Token::Num(x) => Ok(Num(x)),
Token::Int(n) => Ok(Num(n as f64)),
Token::End => Err(scan.scan_error("unexpected end of input",None)),
Token::Error(e) => Err(e),
Token::Iden(s) =>
if s == "null" {Ok(Null)}
else if s == "true" {Ok(Bool(true))}
else if s == "false" {Ok(Bool(false))}
else {Err(scan.scan_error(&format!("unknown identifier '{}'",s),None))},
Token::Char(c) =>
if c == '[' {
let mut ja = Vec::new();
let mut ch = c;
while ch != ']' {
let o = scan_json(scan)?;
ch = scan.get_ch_matching(&[',',']'])?;
ja.push(Box::new(o));
}
Ok(Arr(ja))
} else
if c == '{' {
let mut jo = HashMap::new();
let mut ch = c;
while ch != '}' {
let key = scan.get_string()?;
scan.get_ch_matching(&[':'])?;
let o = scan_json(scan)?;
ch = scan.get_ch_matching(&[',','}'])?;
jo.insert(key,Box::new(o));
}
Ok(Obj(jo))
} else {
Err(scan.scan_error(&format!("bad char '{}'",c),None))
}
}
}
(This is of course an Illustrative Example. JSON is a solved problem.)
Options
With no_float
you get a barebones parser that does not recognize floats,
just integers, strings, chars and identifiers. This is useful if the
existing rules are too strict - e.g "2d" is fine in no_float
mode, but
an error in the default mode. chrono-english
uses this mode to parse date expressions.
With line_comment
you provide a character; after this character, the rest of the current line
will be ignored.