3 releases (breaking)

0.9.0 Mar 3, 2019
0.8.0 Jan 2, 2019
0.7.0 Feb 7, 2018

#503 in Internationalization (i18n)

Download history 126558/week @ 2024-12-15 51150/week @ 2024-12-22 67376/week @ 2024-12-29 129650/week @ 2025-01-05 132100/week @ 2025-01-12 126959/week @ 2025-01-19 128770/week @ 2025-01-26 140473/week @ 2025-02-02 246285/week @ 2025-02-09 218548/week @ 2025-02-16 266858/week @ 2025-02-23 265331/week @ 2025-03-02 191134/week @ 2025-03-09 174481/week @ 2025-03-16 149180/week @ 2025-03-23 162989/week @ 2025-03-30

687,056 downloads per month
Used in 890 crates (8 directly)

MIT/Apache

110KB
1.5K SLoC

UNIC — Unicode Text Segmentation Algorithms

A component of unic: Unicode and Internationalization Crates for Rust.

This UNIC component implements algorithms from Unicode® Standard Annex #29 - Unicode Text Segmentation, used for detecting boundaries of text element boundaries, such as user-perceived characters (a.k.a. Grapheme Clusters), Words, and Sentences (last one not implemented yet).

Examples

assert_eq!(
    Graphemes::new("a\u{310}e\u{301}o\u{308}\u{332}").collect::<Vec<&str>>(),
    &["a\u{310}", "e\u{301}", "o\u{308}\u{332}"]
);

assert_eq!(
    Graphemes::new("a\r\nb🇺🇳🇮🇨").collect::<Vec<&str>>(),
    &["a", "\r\n", "b", "🇺🇳", "🇮🇨"]
);

assert_eq!(
    GraphemeIndices::new("a̐éö̲\r\n").collect::<Vec<(usize, &str)>>(),
    &[(0, ""), (3, ""), (6, "ö̲"), (11, "\r\n")]
);

fn has_alphanumeric(s: &&str) -> bool {
    s.chars().any(|ch| ch.is_alphanumeric())
}

assert_eq!(
    Words::new(
        "The quick (\"brown\") fox can't jump 32.3 feet, right?",
        has_alphanumeric,
    ).collect::<Vec<&str>>(),
    &["The", "quick", "brown", "fox", "can't", "jump", "32.3", "feet", "right"]
);

assert_eq!(
    WordBounds::new("The quick (\"brown\")  fox").collect::<Vec<&str>>(),
    &["The", " ", "quick", " ", "(", "\"", "brown", "\"", ")", " ", " ", "fox"]
);

assert_eq!(
    WordBoundIndices::new("Brr, it's 29.3°F!").collect::<Vec<(usize, &str)>>(),
    &[
        (0, "Brr"),
        (3, ","),
        (4, " "),
        (5, "it's"),
        (9, " "),
        (10, "29.3"),
        (14, "°"),
        (16, "F"),
        (17, "!")
    ]
);

UNIC — Unicode Text Segmentation Algorithms

Crates.io Documentation

This UNIC component implements algorithms from Unicode® Standard Annex #29 - Unicode Text Segmentation, used for detecting boundaries of text element boundaries, such as user-perceived characters (a.k.a. Grapheme Clusters), Words, and Sentences.

Notes

Initial code for this component is based on unicode-segmentation.

Dependencies