6 releases (1 stable)

1.0.0 Oct 27, 2024
0.1.4 Oct 22, 2024
0.1.2 May 24, 2022

#166 in Data structures

MIT license

36KB
645 lines

Installation

[dependencies]
parsable = "0.1"

Example

Implementation of a basic operation interpreter that only works with positive integer and without operator priorities.

use parsable::{parsable, Parsable, ParseOptions};

#[parsable]
enum Operator {
    Plus = "+",
    Minus = "-",
    Mult = "*",
    Div = "/",
    Mod = "%"
}

#[parsable]
struct NumberLiteral {
    #[parsable(regex=r"\d+")]
    value: String
}

impl NumberLiteral {
    fn process(&self) -> i32 {
        self.value.parse().unwrap()
    }
}

#[parsable]
enum Operand {
    Number(NumberLiteral),
    Wrapped(WrappedOperation)
}

impl Operand {
    fn process(&self) -> i32 {
        match self {
            Operand::Number(number) => number.process(),
            Operand::Wrapped(wrapped) => wrapped.process(),
        }
    }
}

#[parsable]
struct Operation {
    first_operand: Operand,
    other_operands: Vec<(Operator, Operand)>
}

impl Operation {
    fn process(&self) -> i32 {
        let mut result = self.first_operand.process();

        for (operator, operand) in &self.other_operands {
            let value = operand.process();

            result = match operator {
                Operator::Plus => result + value,
                Operator::Minus => result - value,
                Operator::Mult => result * value,
                Operator::Div => result / value,
                Operator::Mod => result % value,
            }
        }

        result
    }
}

#[parsable]
struct WrappedOperation {
    #[parsable(brackets="()")]
    operation: Box<Operation>
}

impl WrappedOperation {
    fn process(&self) -> i32 {
        self.operation.process()
    }
}

fn main() {
    let operation_string = "3 + (4 * 5)".to_string();
    let parse_options = ParseOptions::default();
    
    match Operation::parse(operation_string, parse_options) {
        Ok(operation) => {
            println!("result: {}", operation.process());
        },
        Err(error) => {
            dbg!(error);
        }
    }
}

The #[parsable] macro

Tagging a struct or enum with the #[parsable] macro implements the Parsable trait for the item, with the condition that all fields must also implement the Parsable trait.

It can also be applied on a field to tweak the way it is parsed.

Struct

  • All fields are parsed one after the other. The parsing is only successful if all fields are succesfully parsed.

Enum

  • The parsing stops on the first variant that is successfully parsed.
  • If a variant contains multiple fields, they are parsed successively and must all be successful for the variant to be matched.
  • If a variant contains no field, a string must be specified to indicate how to parse it.
#[parsable]
enum MyOperation {
    BinaryOperation(NumerLiteral, Operator, NumerLiteral),
    Number(NumberLiteral),
    Zero = "zero"
}

// If the first two variants are swapped, the parsing will never reach the `BinaryOperation` variant.

Builtin types

String

A string field must be tagged with the #[parsable(regex="<pattern>")] or #[parsable(value="<string>")] macro option to specify how to parse it.

// Matches at least one digit
#[parsable]
struct NumberLiteral {
    #[parsable(regex=r"\d+")]
    value: String
}
#[parsable]
// Only matches the string "+"
struct PlusSign {
    #[parsable(value="+")]
    value: String
}

Option<T>

Matches T. If it fails, returns None but the parsing of the field is still considered successful.

#[parsable]
enum Sign {
    Plus = "+",
    Minus = "-"
}

// Matches a number with an optional sign.
#[parsable]
struct NumberLiteral {
    sign: Option<Sign>,
    #[parsable(regex=r"\d+")]
    value: String
}

Vec<T>

Matches as many T as possible successively. The following options can be specified:

  • min=X: the parsing is only valid if at least X items are parsed
  • separator=<string>: after each item, the parser will attempt to consume the separator. The parsing fails if no separator is found.
// Matches a non-empty list of numbers separated by a comma
#[parsable]
struct NumberList {
    #[parsable(separator=",", min=1)]
    numbers: Vec<NumberLiteral>
}

Other types

  • (): matches nothing, is always successful.
  • (T, U): matches T, then U.
  • Box<T>: matches T.

Running the parser

The Parsable trait provides the parse() method that takes two arguments:

  • content: String: the string to parse
  • options: ParseOptions: parse options

The ParseOptions type has the following fields:

  • comment_start: Option<&'static str>: when the specified pattern is matched, the rest of the line is ignored. Common instances are "//" or "#".
  • file_path: Option<String>: file path of the string being parsed.
  • package_root_path: Option<String>: root path of package or module containing the file being parsed.

The file_path and package_root_path fields are forwarded to the FileInfo struct and are never actually used by the library.

Blank characters (spaces, new lines and tabulations) are always ignored during parsing.

FileInfo

The FileInfo structure is used accross the library. It has the following fields:

  • content: String: the string being parsed
  • path: String: the path of the file being parsed, as specified in ParseOptions
  • package_root_path: String: the path of the package containing the file, as specified in ParseOptions

It also provides the following methods:

  • get_line_col(index: usize) -> Option<(usize, usize)>: returns the line and column numbers (starting at 1) associated with the specified character index. This method assumes 1 character per byte and therefore does not work properly when the file contains non-ascii characters.

ItemLocation

Tagging a struct with #[parsable] adds a location field of type ItemLocation with the following fields & methods:

  • file: Rc<FileInfo>: information on the file containing the item
  • start: usize: starting index of the item in the file
  • end: usize: ending index of the item in the file
  • get_start_line_col() -> (usize, usize): get the line and column numbers (starting at 1) of the location start

The Parsable also trait provides a location() method:

  • on a structure, it returns its location field
  • on an enum, it returns the location() method of the variant that was matched
  • calling location() on a variant with no field panics

A way to prevent the panic is to wrap enums with unit variants in a structure:

#[parsable]
enum Operator {
    Plus = "+",
    Minus = "-",
    Mult = "*",
    Div = "/",
    Mod = "%"
}

#[parsable]
struct WrappedOperator {
    operator: Operator
}

fn main() {
    let string = "+".to_string();
    let options = ParseOptions::default();
    let result = WrappedOperator::parse(string, options).unwrap();

    dbg!(result.location()); // It works!
}

ParseError

On failure, Parsable::parse() returns Err(ParseError). This structure has the following fields:

  • file: Rc<FileInfo>: the file where the error occured.
  • index: usize: the index at which the error occured.
  • expected: Vec<String>: a list of item names that where expected at this index.

Macro options

Root attributes

  • located=<bool>: on a structure, indicates whether or not the location field should be generated. Default: true.
  • cascade=<bool>: if true on a structure, indicates that if an Option field is not matched, then the parser should not attempt to match other Option fields. It does not invalidate the overall struct parsing. Default: false.
  • name=<string>: indicates the name of the struct or enum, which is used in when a parsing error occurs. Default: the name of the struct or enum.
#[parsable(located=false)] // The `location` field will not be added
struct Operation {
    first_operand: Operand,
    other_operands: Vec<(Operator, Operand)>
}

Field attributes

  • prefix=<string>: attempt to parse the specified string before parsing the field. If the prefix parsing fails, then the field parsing fails.
  • suffix=<string>: attempt to parse the specified string after parsing the field. If the suffix parsing fails, then the field parsing fails.
  • brackets=<string>: shortcut to specify both a prefix and a suffix using the first two characters of the specified string.
  • exclude=<string>: indicates that the parsing is only valid if the item does not match the specified regex
  • followed_by=<string>: indicates that the parsing if only valid if the item is followed by the specified regex.
  • not_followed_by=<string>: indicates that the parsing if only valid if the item is not followed by the specified regex.
  • value=<string>: on a String field, indicates that the field only matches the specified string.
  • regex=<string>: on a String field, indicates that the field only matches the regex with the specified pattern (using the regex crate).
  • separator=<string>: on a Vec field, specify the separator between items.
  • min=<integer>: on a Vec field, specify the minimum amount of items for the parsing to be valid.
  • cascade=false: indicates that this field ignore the root cascade option

Manually implementing the Parsable trait

Sometimes #[parsable] is not enough and you want to implement your own parsing mechanism. This is done by implementing the parse_item, get_item_name and location methods.

use parsable::{Parsable, StringReader};

struct MyInteger {
    value: u32,
    location: ItemLocation,
}

impl Parsable for MyInteger {
    fn parse_item(reader: &mut StringReader) -> Option<Self> {
        let start = reader.get_index();

        match reader.read_regex(r"\d+") {
            Some(string) => Some(MyInteger {
                value: string.parse().unwrap(),
                location: reader.get_item_location(start),
            }),
            None => None,
        }
    }

    // Only used in errors
    fn get_item_name() -> String {
        "integer".to_string()
    }

    // Not required, but convenient
    fn location(&self) -> &ItemLocation {
        &self.location
    }
}

fn main() {
    let number_string = "56";
    let number = MyInteger::parse(number_string.to_string(), ParseOptions::default()).unwrap();
    println!("{}", number.value);
}

StringReader wraps the string being parsed with an index that increases as the parsing goes on. It has the following methods:

  • content() -> &str: returns the whole string
  • get_index() -> usize: returns the current index in the string
  • set_index(index: usize) -> usize: set the current index in the string
  • as_str() -> &str: returns the part of the string that has not been parsed yet (same as &self.content()[self.get_index()..])
  • as_char() -> char: returns the current character (same as &self.content().as_bytes()[self.get_index()])
  • is_finished() -> bool: indicates whether the end of the string has been reached
  • advance(length: usize) -> Option<&str>: advance the current index by length and returns the corresponsing substring. If length is 0, returns None
  • eat_spaces(): advance the current index until a non-blank and non-comment character is reached
  • read_string(string: &str) -> Option<&str>: if the string starts with string, advance the current index by string's length and returns it, otherwise returns None
  • read_regex(pattern: &'static str) -> Option<&str>: if the string starts with the specified regex pattern, advance the current index the parsed string'length and returns it, otherwise returns None
  • peek_regex(pattern: &'static str) -> bool: indicates if the string starts with the specified regex pattern, without advancing the current index

If parse_item returns None, it must ensure that the index is the same when the function exits as it was when it started.

License

MIT

Dependencies

~3.5–4.5MB
~92K SLoC