3 releases
new 0.0.3 | Dec 4, 2024 |
---|---|
0.0.2 | Dec 3, 2024 |
0.0.1 | Dec 2, 2024 |
#495 in Text processing
79 downloads per month
31KB
770 lines
Reggy
A friendly regular expression dialect for text analytics. Typical regex features are removed/adjusted to make natural language queries easier. Able to search a stream with several patterns at once.
cargo add reggy
API Usage
Use the high-level Pattern
struct for simple search.
let mut p = Pattern::new("dogs?").unwrap();
assert_eq!(
p.findall("cat dog dogs cats"),
vec![(4, 7), (8, 12)]
);
Use the Ast
struct to transpile to normal regex syntax.[^1]
let ast = Ast::parse(r"do(gg.)?|(!CAT|CAR FAR)").unwrap();
assert_eq!(
ast.to_regex(),
r"(?mi:do(?:gg\.)?|(?-i:CAT|CAR FAR))"
);
Search a Stream
Use the Search
struct to search a stream with several patterns at once.
let mut search = Search::compile(&[
r"$#?#?#.##",
r"(John|Jane) Doe",
]).unwrap();
Call Search::next
to begin searching. It will return definitely-complete matches immediately.
let jane_match = Match::new(1, (0, 8));
assert_eq!(
search.next("Jane Doe paid John"),
vec![jane_match]
);
Call Search::next
again to continue with the same search state.
Note that "John Doe"
matched across the next
boundary, and spans are relative to the start of the stream.
let john_match = Match::new(1, (14, 22));
let money_match_1 = Match::new(0, (23, 29));
let money_match_2 = Match::new(0, (41, 48));
assert_eq!(
search.next(" Doe $45.66 instead of $499.00"),
vec![john_match, money_match_1, money_match_2]
);
Call Search::finish
to collect any not-definitely-complete matches once the stream is closed.
assert_eq!(search.finish(), vec![]);
See more in the API docs.
Pattern Language
Reggy
is case-insensitive by default. Spaces match any amount of whitespace (i.e. \s+
). All the reserved characters mentioned below (\
, (
, )
, ?
, |
, #
, and !
) may be escaped with a backslash for a literal match. Patterns are surrounded by implicit unicode word boundaries (i.e. \b
). Empty patterns or subpatterns are not permitted.
Examples
Make a character optional with ?
dogs?
matches dog
and dogs
Create two or more alternatives with |
dog|cat
matches dog
and cat
Create a sub-pattern with (...)
the qualit(y|ies) required
matches the quality required
and the qualities required
the only( one)? around
matches the only around
and the only one around
Create a case-sensitive sub-pattern with (!...)
United States of America|(!USA)
matches USA
, not usa
Match digits with #
#.##
matches 3.14
Unicode, Stream, and Multi-Pattern Semantics
Reggy
operates on Unicode scalar values. When searching a stream, next
step boundaries are treated as zero-width word boundaries.
Definitely-Complete Matches
Reggy
follows greedy matching semantics. A pattern may match after one step of a stream, yet may match a longer form depending on the next step. For example, ab|abb
will match s.next("ab")
, but a subsequent call to s.next("b")
would create a longer match, "abb"
, which should supercede the match "ab"
.
Search
will only return matches once they are definitely complete and cannot be superceded by future next
calls. Each pattern computes a maximum length L
(this is why unbound quantifiers are absent from Reggy
). Once Reggy
has streamed at most L
bytes, excluding whitespace, past the start of a match without superceding it, that match will be yielded.
Implementation
The pattern language is parsed with lalrpop
(grammar).
The search routines use a regex_automata::dense::DFA
. Compared to other regex engines, the dense DFA is memory-intensive and slow to construct, but searches are fast. All of Reggy
's features are supported by the DFA except Unicode word boundaries, which are handled by the unicode_segmentation
crate.
[^1]: The resulting patterns are equivalent, except that Reggy
treats any continuous substring of spaces as \s+
and surrounds patterns with implicit word boundaries.
Dependencies
~2.3–5MB
~76K SLoC