3 releases (breaking)
0.9.0 | Mar 3, 2019 |
---|---|
0.8.0 | Jan 2, 2019 |
0.7.0 | Feb 7, 2018 |
#340 in Internationalization (i18n)
420,850 downloads per month
Used in 784 crates
(8 directly)
110KB
1.5K
SLoC
UNIC — Unicode Text Segmentation Algorithms
This UNIC component implements algorithms from Unicode® Standard Annex #29 - Unicode Text Segmentation, used for detecting boundaries of text element boundaries, such as user-perceived characters (a.k.a. Grapheme Clusters), Words, and Sentences.
Notes
Initial code for this component is based on
unicode-segmentation
.
lib.rs
:
UNIC — Unicode Text Segmentation Algorithms
A component of unic
: Unicode and Internationalization Crates for Rust.
This UNIC component implements algorithms from Unicode® Standard Annex #29 - Unicode Text Segmentation, used for detecting boundaries of text element boundaries, such as user-perceived characters (a.k.a. Grapheme Clusters), Words, and Sentences (last one not implemented yet).
Examples
assert_eq!(
Graphemes::new("a\u{310}e\u{301}o\u{308}\u{332}").collect::<Vec<&str>>(),
&["a\u{310}", "e\u{301}", "o\u{308}\u{332}"]
);
assert_eq!(
Graphemes::new("a\r\nb🇺🇳🇮🇨").collect::<Vec<&str>>(),
&["a", "\r\n", "b", "🇺🇳", "🇮🇨"]
);
assert_eq!(
GraphemeIndices::new("a̐éö̲\r\n").collect::<Vec<(usize, &str)>>(),
&[(0, "a̐"), (3, "é"), (6, "ö̲"), (11, "\r\n")]
);
fn has_alphanumeric(s: &&str) -> bool {
s.chars().any(|ch| ch.is_alphanumeric())
}
assert_eq!(
Words::new(
"The quick (\"brown\") fox can't jump 32.3 feet, right?",
has_alphanumeric,
).collect::<Vec<&str>>(),
&["The", "quick", "brown", "fox", "can't", "jump", "32.3", "feet", "right"]
);
assert_eq!(
WordBounds::new("The quick (\"brown\") fox").collect::<Vec<&str>>(),
&["The", " ", "quick", " ", "(", "\"", "brown", "\"", ")", " ", " ", "fox"]
);
assert_eq!(
WordBoundIndices::new("Brr, it's 29.3°F!").collect::<Vec<(usize, &str)>>(),
&[
(0, "Brr"),
(3, ","),
(4, " "),
(5, "it's"),
(9, " "),
(10, "29.3"),
(14, "°"),
(16, "F"),
(17, "!")
]
);