#regex #hashtag #emoji

hashtag-regex

A simple regex matching hashtags accoding to the unicode spec: http://unicode.org/reports/tr31/#hashtag_identifiers

2 releases

0.1.1 Nov 4, 2022
0.1.0 Nov 4, 2022

#1613 in Text processing

MIT license

16KB
220 lines

🎉 Simple Regex for Parsing Complicated Hashtags 🎉

This crate exports a single string constant that can safely be used with regex::Regex::new(). It is heavily inspired by hashtag-regex. This regex should match any valid hashtag. If you're convinced that this is not a case, please file a bug. :)

let hashtag_re = Regex::new(&hashtag_regex::HASHTAG_RE_STRING).unwrap();
let text = "Hello #🌍, wassup? Check out this #hashtag magic!";
let all_captures: Vec<regex::Captures> = hashtag_re.captures_iter(text).collect();
// [
//     Captures({0: Some(" #🌍"), 1: Some(" "), "hashtag": Some("#🌍"), "hash": Some("#"), "tag": Some("🌍")}),
//     Captures({0: Some(" #hashtag"), 1: Some(" "), "hashtag": Some("#hashtag"), "hash": Some("#"), "tag": Some("hashtag")})
// ]

This crate happened after I tried building my own regex matching all hashtags, crucially including those containing non-ascii characters and especially emojis. hashtag-regex does exactly that, but not in my language of choice 😊 so I set out to do the same and found emojic which seemed to provide what I needed. I tried building a naive regex out of all emojis from that crate, and quickly ran into problems. A conversation on the emojic issue tracker where @Cryptjar was super friendly and pointed out some of the pitfalls of doing this helped me make progress. But I spent quite some time not understanding why my regexes failed on this case or the other, until I sat down to RTFM on unicode hashtags. Then I read a bit on how Rust handles unicode, and how the regex engine handles character classes, et voilà, it turned out the resulting regexes are not actually that complicated. In fact, they are taken nearly verbatim from the unicode definitions. But I figured that someone else might want to do the same thing and not want to spend so much time reading up on it. So here we are. 🙂

Credit goes to Mathias Bynens for doing this in JavaScript, to @Cryptjar for the help on getting me started, and to @BurntSushi and the other contributors to the regex engine and the rest of Rust's unicode story for making this so simple.

Dependencies

~10KB