7 releases

Uses new Rust 2021

new 0.2.2 Jan 21, 2022
0.2.1 Dec 27, 2021
0.2.0 Nov 2, 2021
0.1.3 Oct 31, 2021

#104 in Text processing

Download history 56/week @ 2021-10-27 28/week @ 2021-11-03 1/week @ 2021-11-10 6/week @ 2021-11-17 1/week @ 2021-11-24 5/week @ 2021-12-01 11/week @ 2021-12-08 33/week @ 2021-12-22 18/week @ 2021-12-29 2/week @ 2022-01-05 7/week @ 2022-01-12

60 downloads per month
Used in stop-words

MIT/Apache

28KB
217 lines

Github CI Crates.io docs.rs

Regex for Humans

The goal of this crate is simple: give everybody the power of regular expressions without having to learn the complicated syntax. It is inspired by ReadableRegex.jl. This crate is a wrapper around the core Rust regex library.

Example usage

If you want to match a date of the format 2021-10-30, you could use the following code to generate a regex:

use human_regex::{beginning, digit, exactly, text, end};
let regex_string = beginning()
    + exactly(4, digit())
    + text("-")
    + exactly(2, digit())
    + text("-")
    + exactly(2, digit())
    + end();
assert!(regex_string.to_regex().is_match("2014-01-01"));

We can do this another way with slightly less repetition though!

use human_regex::{beginning, digit, exactly, text, end};
let first_regex_string = text("-") + exactly(2, digit());
let second_regex_string = beginning()
    + exactly(4, digit())
    + exactly(2, first_regex_string)
    + end();
assert!(second_regex_string.to_regex().is_match("2014-01-01"));

The to_regex() method returns a standard Rust regex.

Roadmap

The eventual goal of this crate is to support all the syntax in the core Rust regex library through a human-readable API. Here is where we currently stand:

Single Character

Implemented? Expression Description
any() . any character except new line (includes new line with s flag)
digit() \d digit (\p{Nd})
non_digit() \D not digit
\pN One-letter name Unicode character class
\p{Greek} Unicode character class (general category or script)
\PN Negated one-letter name Unicode character class
\P{Greek} negated Unicode character class (general category or script)

Character Classes

Implemented? Expression Description
or(&['x', 'y', 'z']) [xyz] A character class matching either x, y or z (union).
[^xyz] A character class matching any character except x, y and z.
[a-z] A character class matching any character in range a-z.
See below [[:alpha:]] ASCII character class ([A-Za-z])
[[:^alpha:]] Negated ASCII character class ([^A-Za-z])
or() [x[^xyz]] Nested/grouping character class (matching any character except y and z)
[a-y&&xyz] Intersection (matching x or y)
[0-9&&[^4]] Subtraction using intersection and negation (matching 0-9 except 4)
[0-9--4] Direct subtraction (matching 0-9 except 4)
[a-g~~b-h] Symmetric difference (matching a and h only)
[\[\]] Escaping in character classes (matching [ or ])

Perl Character Classes

Implemented? Expression Description
digit() \d digit (\p{Nd})
non_digit() \D not digit
whitespace() \s whitespace (\p{White_Space})
non_whitespace() \S not whitespace
word() \w word character (\p{Alphabetic} + \p{M} + \d + \p{Pc} + \p{Join_Control})
non_word() \W not word character

ASCII Character Classes

Implemented? Expression Description
alphanumeric() [[:alnum:]] alphanumeric ([0-9A-Za-z])
alphabetic() [[:alpha:]] alphabetic ([A-Za-z])
ascii() [[:ascii:]] ASCII ([\x00-\x7F])
blank() [[:blank:]] blank ([\t ])
control() [[:cntrl:]] control ([\x00-\x1F\x7F])
digit() [[:digit:]] digits ([0-9])
graphical() [[:graph:]] graphical ([!-~])
uppercase() [[:lower:]] lower case ([a-z])
printable() [[:print:]] printable ([ -~])
punctuation() [[:punct:]] punctuation ([!-/:-@\[-`{-~])
whitespace() [[:space:]] whitespace ([\t\n\v\f\r ])
lowercase() [[:upper:]] upper case ([A-Z])
word() [[:word:]] word characters ([0-9A-Za-z_])
hexdigit() [[:xdigit:]] hex digit ([0-9A-Fa-f])

Repetitions

Implemented? Expression Description
zero_or_more(x) x* zero or more of x (greedy)
one_or_more(x) x+ one or more of x (greedy)
zero_or_one(x) x? zero or one of x (greedy)
zero_or_more(x) x*? zero or more of x (ungreedy/lazy)
one_or_more(x).lazy() x+? one or more of x (ungreedy/lazy)
zero_or_more(x).lazy() x?? zero or one of x (ungreedy/lazy)
between(n, m, x) x{n,m} at least n x and at most m x (greedy)
at_least(n, x) x{n,} at least n x (greedy)
exactly(n, x) x{n} exactly n x
between(n, m, x).lazy() x{n,m}? at least n x and at most m x (ungreedy/lazy)
at_least(n, x).lazy() x{n,}? at least n x (ungreedy/lazy)

Composites

Implemented? Expression Description
+ xy concatenation (x followed by y)
or() x|y alternation (x or y, prefer x)

Empty matches

Implemented? Expression Description
beginning() ^ the beginning of text (or start-of-line with multi-line mode)
end() $ the end of text (or end-of-line with multi-line mode)
beginning_of_text() \A only the beginning of text (even with multi-line mode enabled)
end_of_text() \z only the end of text (even with multi-line mode enabled)
word_boundary() \b a Unicode word boundary (\w on one side and \W, \A, or \z on other)
non_word_boundary() \B not a Unicode word boundary

Groupings

Implemented? Expression Description
capture(exp) (exp) numbered capture group (indexed by opening parenthesis)
named_capture(exp, name) (?P<name>exp) named (also numbered) capture group
Handled implicitly through functional composition (?:exp) non-capturing group
See below (?flags) set flags within current group
See below (?flags:exp) set flags for exp (non-capturing)

Flags

Implemented? Expression Description
case_insensitive(exp) i case-insensitive: letters match both upper and lower case
multi_line_mode(exp) m multi-line mode: ^ and $ match begin/end of line
dot_matches_newline_too(exp) s allow . to match \n
will not be implemented1 U swap the meaning of x* and x*?
disable_unicode(exp) u Unicode support (enabled by default)
will not be implemented2 x ignore whitespace and allow line comments (starting with #)
  1. With the declarative nature of this library, use of this flag would just obfuscate meaning.
  2. When using human_regex, comments should be added in source code rather than in the regex string.

Dependencies

~1–1.3MB
~38K SLoC

a