#text #unicode #grapheme #word #boundary

unic-segment

UNIC — Unicode Text Segmentation Algorithms

3 releases (breaking)

✓ Uses Rust 2018 edition

0.9.0 Mar 3, 2019
0.8.0 Jan 2, 2019
0.7.0 Feb 7, 2018

#22 in Internationalization (i18n)

Download history 825/week @ 2018-12-20 1186/week @ 2018-12-27 940/week @ 2019-01-03 940/week @ 2019-01-10 1274/week @ 2019-01-17 1288/week @ 2019-01-24 1197/week @ 2019-01-31 1223/week @ 2019-02-07 1458/week @ 2019-02-14 1221/week @ 2019-02-21 1325/week @ 2019-02-28 1319/week @ 2019-03-07 1259/week @ 2019-03-14 1470/week @ 2019-03-21 1258/week @ 2019-03-28

5,322 downloads per month
Used in 44 crates (3 directly)

MIT/Apache

106KB
1.5K SLoC

UNIC — Unicode Text Segmentation Algorithms

Crates.io Documentation

This UNIC component implements algorithms from Unicode® Standard Annex #29 - Unicode Text Segmentation, used for detecting boundaries of text element boundaries, such as user-perceived characters (a.k.a. Grapheme Clusters), Words, and Sentences.

Notes

Initial code for this component is based on unicode-segmentation.


lib.rs:

UNIC — Unicode Text Segmentation Algorithms

A component of unic: Unicode and Internationalization Crates for Rust.

This UNIC component implements algorithms from Unicode® Standard Annex #29 - Unicode Text Segmentation, used for detecting boundaries of text element boundaries, such as user-perceived characters (a.k.a. Grapheme Clusters), Words, and Sentences (last one not implemented yet).

Examples

# use unic_segment::{GraphemeIndices, Graphemes, WordBoundIndices, WordBounds, Words};
assert_eq!(
    Graphemes::new("a\u{310}e\u{301}o\u{308}\u{332}").collect::<Vec<&str>>(),
    &["a\u{310}", "e\u{301}", "o\u{308}\u{332}"]
);

assert_eq!(
    Graphemes::new("a\r\nb🇺🇳🇮🇨").collect::<Vec<&str>>(),
    &["a", "\r\n", "b", "🇺🇳", "🇮🇨"]
);

assert_eq!(
    GraphemeIndices::new("a̐éö̲\r\n").collect::<Vec<(usize, &str)>>(),
    &[(0, ""), (3, ""), (6, "ö̲"), (11, "\r\n")]
);

fn has_alphanumeric(s: &&str) -> bool {
    s.chars().any(|ch| ch.is_alphanumeric())
}

assert_eq!(
    Words::new(
        "The quick (\"brown\") fox can't jump 32.3 feet, right?",
        has_alphanumeric,
    ).collect::<Vec<&str>>(),
    &["The", "quick", "brown", "fox", "can't", "jump", "32.3", "feet", "right"]
);

assert_eq!(
    WordBounds::new("The quick (\"brown\")  fox").collect::<Vec<&str>>(),
    &["The", " ", "quick", " ", "(", "\"", "brown", "\"", ")", " ", " ", "fox"]
);

assert_eq!(
    WordBoundIndices::new("Brr, it's 29.3°F!").collect::<Vec<(usize, &str)>>(),
    &[
        (0, "Brr"),
        (3, ","),
        (4, " "),
        (5, "it's"),
        (9, " "),
        (10, "29.3"),
        (14, "°"),
        (16, "F"),
        (17, "!")
    ]
);

Dependencies