#cldr #icu #unicode #localization #i18n #segmenter

icu_segmenter

Unicode line breaking and text segmentation algorithms for text boundaries analysis

3 releases (unstable)

new 1.0.0-alpha1 Aug 5, 2022
0.6.0 May 17, 2022
0.0.1 Apr 29, 2021

#112 in Internationalization (i18n)

Download history 7/week @ 2022-04-27 4/week @ 2022-05-04 24/week @ 2022-05-11 26/week @ 2022-05-18 16/week @ 2022-05-25 22/week @ 2022-06-01 6/week @ 2022-06-08 19/week @ 2022-06-15 13/week @ 2022-06-22 9/week @ 2022-06-29 10/week @ 2022-07-06 14/week @ 2022-07-13 19/week @ 2022-07-20 15/week @ 2022-07-27 46/week @ 2022-08-03

94 downloads per month
Used in 5 crates (2 directly)

Unicode-DFS-2016

3.5MB
14K SLoC

icu_segmenter crates.io

[Experimental] Segment strings by lines, graphemes, word, and sentences.

This module is published as its own crate (icu_segmenter) and as part of the icu crate. See the latter for more details on the ICU4X project.

This module contains segmenter implementation for the following rules.

Examples

Line Break

Segment a string with default options:

use icu_segmenter::LineBreakSegmenter;

let provider = icu_testdata::get_provider();
let segmenter = LineBreakSegmenter::try_new(&provider).expect("Data exists");

let breakpoints: Vec<usize> = segmenter.segment_str("Hello World").collect();
assert_eq!(&breakpoints, &[6, 11]);

Segment a string with CSS option overrides:

use icu_segmenter::{LineBreakOptions, LineBreakRule, LineBreakSegmenter, WordBreakRule};

let mut options = LineBreakOptions::default();
options.line_break_rule = LineBreakRule::Strict;
options.word_break_rule = WordBreakRule::BreakAll;
options.ja_zh = false;
let provider = icu_testdata::get_provider();
let segmenter =
    LineBreakSegmenter::try_new_with_options(&provider, options).expect("Data exists");

let breakpoints: Vec<usize> = segmenter.segment_str("Hello World").collect();
assert_eq!(&breakpoints, &[1, 2, 3, 4, 6, 7, 8, 9, 10, 11]);

Segment a Latin1 byte string:

use icu_segmenter::LineBreakSegmenter;

let provider = icu_testdata::get_provider();
let segmenter = LineBreakSegmenter::try_new(&provider).expect("Data exists");

let breakpoints: Vec<usize> = segmenter.segment_latin1(b"Hello World").collect();
assert_eq!(&breakpoints, &[6, 11]);

Grapheme Cluster Break

Segment a string:

use icu_segmenter::GraphemeClusterBreakSegmenter;
let provider = icu_testdata::get_provider();
let segmenter = GraphemeClusterBreakSegmenter::try_new(&provider).expect("Data exists");

let breakpoints: Vec<usize> = segmenter.segment_str("Hello 🗺").collect();
// World Map (U+1F5FA) is encoded in four bytes in UTF-8.
assert_eq!(&breakpoints, &[0, 1, 2, 3, 4, 5, 6, 10]);

Segment a Latin1 byte string:

use icu_segmenter::GraphemeClusterBreakSegmenter;
let provider = icu_testdata::get_provider();
let segmenter = GraphemeClusterBreakSegmenter::try_new(&provider).expect("Data exists");

let breakpoints: Vec<usize> = segmenter.segment_latin1(b"Hello World").collect();
assert_eq!(&breakpoints, &[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]);

Word Break

Segment a string:

use icu_segmenter::WordBreakSegmenter;
let provider = icu_testdata::get_provider();
let segmenter = WordBreakSegmenter::try_new(&provider).expect("Data exists");

let breakpoints: Vec<usize> = segmenter.segment_str("Hello World").collect();
assert_eq!(&breakpoints, &[0, 5, 6, 11]);

Segment a Latin1 byte string:

use icu_segmenter::WordBreakSegmenter;
let provider = icu_testdata::get_provider();
let segmenter = WordBreakSegmenter::try_new(&provider).expect("Data exists");

let breakpoints: Vec<usize> = segmenter.segment_latin1(b"Hello World").collect();
assert_eq!(&breakpoints, &[0, 5, 6, 11]);

Sentence Break

Segment a string:

use icu_segmenter::SentenceBreakSegmenter;
let provider = icu_testdata::get_provider();
let segmenter = SentenceBreakSegmenter::try_new(&provider).expect("Data exists");

let breakpoints: Vec<usize> = segmenter.segment_str("Hello World").collect();
assert_eq!(&breakpoints, &[0, 11]);

Segment a Latin1 byte string:

use icu_segmenter::SentenceBreakSegmenter;
let provider = icu_testdata::get_provider();
let segmenter = SentenceBreakSegmenter::try_new(&provider).expect("Data exists");

let breakpoints: Vec<usize> = segmenter.segment_latin1(b"Hello World").collect();
assert_eq!(&breakpoints, &[0, 11]);

More Information

For more information on development, authorship, contributing etc. please visit ICU4X home page.

Dependencies