11 releases
Uses new Rust 2024
| 0.2.2 | Jan 9, 2026 |
|---|---|
| 0.2.0 | Jan 8, 2026 |
| 0.1.2 | Jan 7, 2026 |
| 0.1.1 | Jan 6, 2026 |
| 0.1.0-alpha.7 | Dec 31, 2025 |
#356 in Text processing
Used in monster-rift
205KB
4.5K
SLoC
Rift Search Specification
This document outlines the regular expression syntax and features supported by Rift's search engine.
Usage
Add monster-regex to your Cargo.toml:
[dependencies]
monster-regex = "0.2.2"
Basic Example (Backtracking Engine)
By default, Regex::new uses the BacktrackingRegexEngine. This engine supports advanced features like lookarounds and backreferences but may have exponential runtime on pathological patterns.
use monster_regex::{Regex, Flags};
fn main() {
// Compile using the default backtracking engine
let re = Regex::new(r"\w+", Flags::default()).unwrap();
assert!(re.is_match("hello"));
// Find a match
if let Some(m) = re.find("hello world") {
println!("Found match at {}-{}", m.start, m.end); // 0-5
}
}
Linear Engine (O(n))
For performance-critical code where O(n) guarantees are required, use the LinearRegexEngine (based on PikeVM). Note that this engine does not support lookarounds or backreferences.
use monster_regex::{Regex, Flags};
fn main() {
// Explicit constructor for the linear engine
let re = Regex::new_linear(r"a.*b", Flags::default()).unwrap();
assert!(re.is_match("abbb"));
}
Dynamic Engine Dispatch
You can switch between engines at runtime using AnyRegexEngine. This allows you to choose the best engine for the pattern or use case.
use monster_regex::engine::{
AnyRegexEngine, RegexEngine, CompiledRegex,
backtracking::BacktrackingRegexEngine,
linear::LinearRegexEngine
};
use monster_regex::Flags;
fn main() {
let use_linear = true;
let flags = Flags::default();
let pattern = "abc";
// Type-erased engine trait object
let engine: Box<dyn RegexEngine<Regex = Box<dyn CompiledRegex>>> = if use_linear {
Box::new(AnyRegexEngine(LinearRegexEngine))
} else {
Box::new(AnyRegexEngine(BacktrackingRegexEngine))
};
// Compile returns a Box<dyn CompiledRegex>
let regex = engine.compile(pattern, flags).unwrap();
assert!(regex.is_match("abc"));
}
Architecture & Traits
monster-regex exposes two key traits for compiled regexes:
CompiledRegex: Object-safe trait containing core methods (is_match,find,captures,replace). Usable with&str. This is the return type when using dynamic dispatch.CompiledRegexHaystack: Generic trait extendingCompiledRegexfor streaming support via theHaystacktrait. Not object-safe.
When using dynamic dispatch (Box<dyn CompiledRegex>), you are limited to the methods in CompiledRegex (string-based) and cannot use the streaming Haystack API directly on the trait object.
Using Flags
You can configure behavior using Flags:
use monster_regex::{Regex, Flags};
fn main() {
let mut flags = Flags::default();
flags.ignore_case = Some(true); // Case insensitive
flags.multiline = true; // ^ and $ match line boundaries
let re = Regex::new(r"^hello", flags).unwrap();
assert!(re.is_match("HELLO\nworld"));
}
Parsing Rift Format
You can also parse patterns in the pattern/flags format used by Rift:
use monster_regex::parse_rift_format;
use monster_regex::Regex;
fn main() {
let (pattern, flags) = parse_rift_format("abc/i").unwrap();
let re = Regex::new(&pattern, flags).unwrap();
assert!(re.is_match("ABC"));
}
Find All
use monster_regex::{Regex, Flags};
fn main() {
let re = Regex::new(r"\d+", Flags::default()).unwrap();
let text = "123 abc 456";
for m in re.find_all(text) {
println!("Match: {}", &text[m.start..m.end]);
}
}
Replacement
use monster_regex::{Regex, Flags};
fn main() {
let re = Regex::new(r"foo", Flags::default()).unwrap();
// Replace first occurrence only
let result = re.replace("foo bar foo", "baz");
assert_eq!(result, "baz bar foo");
// Replace all occurrences
let result = re.replace_all("foo bar foo", "baz");
assert_eq!(result, "baz bar baz");
}
Captures Iterator
use monster_regex::{Regex, Flags};
fn main() {
let re = Regex::new(r"(\w+)@(\w+)", Flags::default()).unwrap();
let text = "alice@home bob@work";
for caps in re.captures_all(text) {
println!("Full match: {:?}", caps.full_match);
println!("Groups: {:?}", caps.groups);
}
}
Inspecting Pattern and Flags
use monster_regex::{Regex, Flags};
fn main() {
let mut flags = Flags::default();
flags.ignore_case = Some(true);
let re = Regex::new(r"hello", flags).unwrap();
// Access the original pattern
assert_eq!(re.pattern(), "hello");
// Access the flags used during compilation
assert_eq!(re.flags().ignore_case, Some(true));
}
Streaming / Zero-Copy Search
For advanced use cases like searching non-contiguous memory (ropes, gap buffers) without allocation, implement the Haystack trait:
use monster_regex::{Regex, Haystack};
#[derive(Copy, Clone)]
struct MyRope<'a> {
// ... custom internal structure
phantom: std::marker::PhantomData<&'a ()>,
}
impl<'a> Haystack for MyRope<'a> {
fn len(&self) -> usize { /* ... */ }
fn char_at(&self, pos: usize) -> Option<(char, usize)> { /* ... */ }
fn char_before(&self, pos: usize) -> Option<char> { /* ... */ }
fn matches_range(&self, pos: usize, other_start: usize, other_end: usize) -> bool { /* ... */ }
fn starts_with(&self, pos: usize, literal: &str) -> bool { /* ... */ }
}
fn main() {
let rope = MyRope { /* ... */ };
let re = Regex::new("pattern", Default::default()).unwrap();
// Check if pattern matches anywhere
if re.is_match_from(rope) {
println!("Found a match!");
}
// Find first match
if let Some(m) = re.find_from(rope) {
println!("Match at {}-{}", m.start, m.end);
}
// Find match starting at a specific offset
if let Some(m) = re.find_from_at(rope, 10) {
println!("Match starting from offset 10: {}-{}", m.start, m.end);
}
// Iterate all matches
for m in re.find_all_from(rope) {
// ...
}
}
1. General Syntax
Search patterns are entered in the format:
pattern/flags
- Pattern: The regex to match.
- Flags: Optional single-character flags modifying the search behavior.
Special Characters
The following characters have special meaning and must be escaped with \ to be matched literally:
. * + ? ^ $ | ( ) [ ] { } \
All other characters match themselves literally.
Note on Dot (.):
By default, . matches any character except newline. Use the s (dotall) flag to make . match newlines.
Case Sensitivity
- Default (Smartcase): Case-insensitive if the pattern contains only lowercase letters. Case-sensitive if the pattern contains any uppercase letters.
- Overrides: Can be explicitly set using the
i(ignore-case) orc(case-sensitive) flags.
2. Quantifiers
Quantifiers specify how many times the preceding atom (character, group, or character class) should match.
| Quantifier | Meaning | Greedy? | Example |
|---|---|---|---|
* |
0 or more | Yes | a* matches "", "a", "aa"... |
+ |
1 or more | Yes | a+ matches "a", "aa"... |
? |
0 or 1 | Yes (prefers 1) | a? matches "" or "a", preferring "a" |
{n} |
Exactly n | — | a{3} matches "aaa" |
{n,m} |
n to m | Yes | a{2,4} matches "aa", "aaa", "aaaa" |
{n,} |
n or more | Yes | a{2,} matches "aa", "aaa"... |
{,m} |
0 to m | Yes | a{,3} matches "", "a", "aa", "aaa" |
*? |
0 or more | No | a*? matches minimal characters |
+? |
1 or more | No | a+? matches minimal characters |
?? |
0 or 1 | No | a?? prefers 0 matches |
{n,m}? |
n to m | No | a{2,4}? matches "aa" before "aaa" |
3. Character Classes
Standard Classes
| Class | Matches |
|---|---|
\d |
Digit [0-9] |
\D |
Non-digit |
\w |
Word character [a-zA-Z0-9_] (ASCII by default) |
\W |
Non-word character |
\s |
Whitespace [ \t\r\n\f\v] |
\S |
Non-whitespace |
Extended Classes
| Class | Matches |
|---|---|
\l |
Lowercase character |
\L |
Non-lowercase character |
\u |
Uppercase character |
\U |
Non-uppercase character |
\x |
Hexadecimal digit |
\X |
Non-hexadecimal digit |
\o |
Octal digit |
\O |
Non-octal digit |
\h |
Head of word character (start of a word) |
\H |
Non-head of word character |
\p |
Punctuation [!"#$%&'()*+,\-./:;<=>?@\[\\\]^_{ |
\P |
Non-punctuation |
\a |
Alphanumeric [a-zA-Z0-9] |
\A |
Non-alphanumeric |
Unicode Support
- Default:
\w,\d,\s,\hmatch ASCII characters only. - With
uflag: These classes include Unicode characters (e.g.,\wmatches accented characters).
Character Sets
Custom character sets and ranges (e.g., [a-z], [^0-9]) are supported.
Note on Escaping in Character Classes:
In character classes, special meaning is different. For example, [\]] matches a literal ], and [a\-z] matches a, \, or -.
4. Anchors and Boundaries
Anchors assert a position without matching characters (zero-width).
| Anchor | Meaning |
|---|---|
^ |
Start of string (or start of line in multiline mode) |
$ |
End of string (or end of line in multiline mode) |
\< |
Start of word |
\> |
End of word |
\b |
Word boundary (matches at \< or \>) |
\zs |
Sets the start of the match (everything before is excluded from the result) |
\ze |
Sets the end of the match (everything after is excluded from the result) |
Position Anchors
These anchors match at a specific position in the buffer. They are zero-width assertions and do not consume characters.
| Anchor | Meaning | Example |
|---|---|---|
\%nl |
Matches anywhere on line n (1-indexed). | \%5lfoo matches "foo" only if it appears on line 5. |
Not implemented in the parser, clients must handle line-based matching.
| \%nc | Matches at column n (1-indexed). | \%5cfoo matches "foo" starting at column 5. |
| \%# | Matches at the current cursor position. | \%#foo matches "foo" starting exactly under the cursor. |
Word Boundaries Explained
\<: Matches the position where a word starts (preceded by non-word, followed by word char).\>: Matches the position where a word ends (preceded by word char, followed by non-word).\b: Matches at either\<or\>.
Word boundaries \< and \> use the same character definition as \w ([a-zA-Z0-9_]). With the u flag, both adapt to Unicode.
5. Flags
Flags are appended after the pattern delimiter (e.g., pattern/flags).
| Flag | Name | Description |
|---|---|---|
i |
ignore-case | Case-insensitive matching (overrides smartcase). |
c |
case-sensitive | Case-sensitive matching (overrides smartcase). |
m |
multiline | ^ and $ match line boundaries (\n), not just the start/end of the entire buffer. |
s |
dotall | . matches newlines (including end-of-line). |
x |
verbose | Whitespace and # comments in the pattern are ignored. Literal spaces must be escaped (e.g., \ or [ ]). |
g |
global | Match all occurrences (used for find-all or replace operations). |
u |
unicode | Enables Unicode support for character classes (\w, \d, etc.). |
Verbose Mode Examples (x flag):
/foo bar/xmatches "foobar" (space is ignored)./foo\ bar/xmatches "foo bar" (space is escaped)./foo[ ]bar/xmatches "foo bar" (space in bracket).
6. Escape Sequences
| Sequence | Matches |
|---|---|
\n |
Newline (LF) |
\t |
Tab |
\r |
Carriage return (CR) |
\f |
Form feed |
\v |
Vertical tab |
\\ |
Literal backslash |
7. Groups, Alternation, and Assertions
- Alternation:
pattern1|pattern2matches either pattern1 or pattern2. - Grouping:
(pattern)groups part of the regex and captures it. - Named Capture:
(?<name>pattern)captures the group with a specific name. - Non-Capturing Group:
(?:pattern)groups without capturing. - Backreferences:
\1through\9refer to captured groups 1-9.\0refers to the entire match.
Lookaround Assertions
Lookarounds assert that what follows or precedes the current position matches a pattern, without including it in the match result.
| Assertion | Type | Meaning |
|---|---|---|
(?>=foo) |
Positive Lookahead | Matches if followed by "foo". |
(?>!foo) |
Negative Lookahead | Matches if not followed by "foo". |
(?<=foo) |
Positive Lookbehind | Matches if preceded by "foo". |
(?<!foo) |
Negative Lookbehind | Matches if not preceded by "foo". |