#lexer #matlab #syntax #octave #run-mat #transpose #marker #interpreter #token-stream

runmat-lexer

Lexer for the RunMat language (MATLAB/Octave syntax) built with logos

14 unstable releases (3 breaking)

new 0.4.1 Apr 16, 2026
0.4.0 Apr 13, 2026
0.3.2 Mar 24, 2026
0.2.8 Dec 22, 2025
0.0.17 Oct 15, 2025

#2081 in Math

Download history 11/week @ 2025-12-25 20/week @ 2026-01-01 7/week @ 2026-01-22 21/week @ 2026-01-29 5/week @ 2026-02-05 3/week @ 2026-02-12 19/week @ 2026-02-19 14/week @ 2026-02-26 3/week @ 2026-03-05 29/week @ 2026-03-12 1/week @ 2026-03-19 34/week @ 2026-04-02 32/week @ 2026-04-09

76 downloads per month
Used in 15 crates (5 directly)

MIT license

34KB
654 lines

RunMat Lexer

This crate tokenizes MATLAB/Octave source code into a stream of tokens for the parser. It uses the logos library to define a fast, zero-copy DFA with a small amount of context via LexerExtras to handle MATLAB-specific ambiguities.

Design goals

  • Correct tokenization for the full MATLAB language surface
  • Minimal, explicit state for disambiguation (apostrophe transpose vs string, section markers, etc.)
  • Compatibility with the rest of the toolchain (parser, HIR, interpreter, JIT)
  • Predictable tokens: avoid over-encoding semantics at the lexing stage

Context-aware lexing

We track two pieces of context in LexerExtras:

  • last_was_value: bool — true if the previous emitted token forms a value. Used to disambiguate ' as transpose vs string start.
  • line_start: bool — true if we are at the beginning of a logical line. Used for %% section markers.

Tokens overview

  • Keywords: function if elseif else for while break continue return end
  • Additional keywords: switch case otherwise try catch global persistent true false
  • OOP keywords: classdef properties methods events enumeration arguments
  • Import: import
  • Identifiers: [A-Za-z_][A-Za-z0-9_]*
  • Numbers: integers and floats with optional exponents
  • Strings:
    • Single-quoted character arrays: '...' with doubled quotes '' inside
    • Double-quoted string scalars: "..." with doubled quotes "" inside
  • Operators and punctuation:
    • Arithmetic: + - * / \ ^
    • Element-wise: .* ./ .\ .^
    • Relational: == ~= < <= > >=
    • Logical: && || & | ~
    • Transpose: ' (contextual)
    • Colon: :
    • Dotted member access: .
    • Function handle/anonymous: @
    • Meta-class query: ? (e.g., ?MyClass)
    • Assignment and separators: = , ;
    • Grouping and containers: () [] {}
  • Comments & layout:
    • Line comment: % to end of line
    • Section marker: %% at start of line
    • Block comment: %{ ... %} (non-nesting)
    • Line continuation: ... (skips remainder of physical line)
    • Newlines reset line_start

Notable disambiguations

  • Apostrophe ':
    • If previous token was a value (identifier, number, ) ] }), emit Transpose
    • Otherwise, let the string regex capture a full single-quoted character array
  • Section %%:
    • Only emitted when line_start == true; otherwise % starts a normal line comment
  • Line continuation ...:
    • Emits Ellipsis and consumes the remainder of the physical line, including any % comment following it

Non-goals at lexing time

The lexer purposefully does not encode high-level semantics:

  • Integer class names like int8/uint64 are identifiers
  • Special variables like varargin/varargout/ans are identifiers
  • OOP features (handle inheritance, method attributes) are parsed/handled later
  • Command/function syntax duality is resolved in parsing/semantic phases

Tests

See tests/ for comprehensive coverage, organized by topic:

  • lexer.rs: core tokens, operators, single-quoted strings, comments, ellipsis
  • transpose.rs: detailed diagnostics and assertions for apostrophe (') transpose cases
  • comments_continuation.rs: % line comments, %{...%} block comments, %% section markers, ... continuation
  • operators.rs: logical and element-wise operators (e.g., .* ./ .\ .^ && || & | ~)
  • namespaces.rs: import paths (including wildcard) and metaclass ?ClassName
  • oop_tokens.rs: OOP keywords (classdef, properties, methods, events, enumeration, arguments) and function handles @
  • strings_chars.rs: double-quoted string scalars and apostrophe disambiguation exercises
  • tokens_basic.rs: identifiers, numbers, separators (; ,), and simple keyword smoke tests

All lexer tests pass when running the crate tests on their own.

Guidelines for extending the lexer

  • Prefer adding new tokens only when lexical distinctions are required.
  • When in doubt, keep ambiguous terms as identifiers and resolve in the parser.
  • If you need context to disambiguate, add a boolean/flag in LexerExtras and use a Logos callback to Emit or Skip appropriately.
  • Keep regular expressions simple (no look-around) and rely on token priority and callbacks for precedence and control.

Known compatibility notes

  • Non-conjugate transpose .' is tokenized as Dot then Transpose. The parser should interpret this pair as the non-conjugating transpose.
  • Block comments %{...%} are treated as non-nesting by design.
  • Error-recovery is implemented to keep producing useful tokens after invalid input; in recovery mode double-quoted strings are recognized as a single Str token, while malformed single-quoted sequences may be split to allow downstream error reporting.

Remaining edges

  • Apostrophe vs string: extreme adjacency cases across ... continuation and % comments are covered by tests; a few rare permutations may still be added as seeds (parser semantics unaffected).
  • Block comments are intentionally non-nesting; any future change would be a parser/runtime decision, not lexing.
  • Command-form is resolved in the parser; lexer's role is complete for milestone.

Crate integration

  • This crate only produces tokens; it does not attempt to validate grammar.
  • Downstream crates (runmat-parser, runmat-hir, runmat-ignition, runmat-turbine) are responsible for structure and semantics.

Dependencies

~42KB