#token #tokenizer #string #numbers #line-string #python #source

tokenizer_py

crate with a tokenizer that works like a Python tokenizer

6 releases

0.2.0 Feb 24, 2024
0.1.4 Feb 22, 2024

#6 in #line-string

Download history 348/week @ 2024-02-16 185/week @ 2024-02-23 25/week @ 2024-03-01 13/week @ 2024-04-05 269/week @ 2024-04-12

282 downloads per month

MIT/Apache

42KB
739 lines

Python-like Tokenizer in Rust

Static Badge Crates.io Version Crates.io MSRV (version) docs.rs (with version) GitHub Actions Workflow Status Crates.io License

This project implements a Python-like tokenizer in Rust. It can tokenize a string into a sequence of tokens, which are represented by the Token enum. The supported tokens are:

  • Token::Name: a name token, such as a function or variable name.
  • Token::Number: a number token, such as a literal integer or floating-point number.
  • Token::String: a string token, such as a single or double-quoted string.
  • Token::OP: an operator token, such as an arithmetic or comparison operator.
  • Token::Indent: an indent token, indicating that a block of code is being indented.
  • Token::Dedent: a dedent token, indicating that a block of code is being dedented.
  • Token::Comment: a comment token, such as a single-line or multi-line comment.
  • Token::NewLine: a newline token, indicating a new line in the source code.
  • Token::NL: a token indicating a new line, for compatibility with the original tokenizer.
  • Token::EndMarker: an end-of-file marker.

The tokenizer recognizes the following tokens:

  • Whitespace: spaces, tabs, and newlines.
  • Numbers: integers and floating-point numbers.
    • float: floats numbers.
    • int: integer numbers.
    • complex: complex numbers.
  • Names: identifiers and keywords.
  • Strings: single- and double-quoted strings.
    • basic-String: single- and double-quoted strings.
    • format-String: format string from python.
    • byte-String: byte string from python.
    • raw-String: raw string.
    • multy-line-String: single- and double-quoted multy-line-string.
    • combined-string: string with combined prefix.
  • Operators: arithmetic, comparison, and other operators.
  • Comments: single-line comments.

The tokenizer also provides a tokenize method that takes a string as input and returns a Result containing a vector of tokens.

Usage

Add this to your Cargo.toml:

[dependencies]
tokenizer_py = "0.2.0"

Exemples

Example of using the tokenizer to tokenize the string "hello world"

use tokenizer_py::{tokenize, Token};

let tokens = tokenize("hello world").unwrap();
assert_eq!(tokens, vec![
    Token::Name("hello".to_string()), // Token of the name "hello"
    Token::Name("world".to_string()), // Token of the name "world"
    Token::NewLine, // New line token
    Token::EndMarker, // End of text token
]);

Example of using the BinaryExp structure to evaluate the binary expression "10 + 10"

use tokenizer_py::{tokenize, Token};

// Structure representing a binary expression 
struct BinaryExp {
    left: Token,
    center: Token,
    right: Token,
}

impl BinaryExp {
    // Method for creating a new instance of BinaryExp
    fn new(left: Token, center: Token, right: Token) -> Self {
        BinaryExp { left, center, right }
    }
    // Method for executing the binary expression
    fn execute(&self) -> Result<isize, <isize as std::str::FromStr>::Err> {
        use Token::{Number, OP};
        match (&self.left, &self.center, &self.right) {
            (Number(ref left), OP(ref op), Number(ref right)) => {
                let (left, right) = (
                    left.parse::<isize>()?, right.parse::<isize>()?
                );
                match op.as_str() {
                    "+" => Ok(left + right),
                    "-" => Ok(left - right),
                    "*" => Ok(left * right),
                    "/" => Ok(left / right),
                    "%" => Ok(left % right),
                    _ => panic!("Invalid operator"), // Invalid operator
                }
            }
            _ => panic!("Invalid tokens"), // Invalid tokens
        }
    }
}
let mut tokens = tokenize("10 + 10").unwrap();
let _ = tokens.pop(); // Remove Token::EndMarker
let _ = tokens.pop(); // Remove Token::NewLine
let binexp = BinaryExp::new(
tokens.pop().unwrap(),
tokens.pop().unwrap(),
tokens.pop().unwrap()
);
assert_eq!(binexp.execute(), Ok(20)); // Checking the execution result

No runtime deps