#icu #cldr #unicode #localization #segmentation #segmenter

icu_segmenter

Unicode line breaking and text segmentation algorithms for text boundaries analysis

5 releases (3 breaking)

1.0.0-alpha1 Aug 5, 2022
0.8.0 Jan 26, 2023
0.7.0 Sep 27, 2022
0.6.0 May 17, 2022
0.0.1 Apr 29, 2021

#12 in #cldr

Download history 17/week @ 2022-11-26 25/week @ 2022-12-03 46/week @ 2022-12-10 75/week @ 2022-12-17 25/week @ 2022-12-24 12/week @ 2022-12-31 14/week @ 2023-01-07 28/week @ 2023-01-14 101/week @ 2023-01-21 104/week @ 2023-01-28 190/week @ 2023-02-04 186/week @ 2023-02-11 219/week @ 2023-02-18 310/week @ 2023-02-25 290/week @ 2023-03-04 180/week @ 2023-03-11

1,038 downloads per month
Used in 9 crates (4 directly)

Unicode-DFS-2016

3.5MB
18K SLoC

icu_segmenter crates.io

[Experimental] Segment strings by lines, graphemes, word, and sentences.

This module is published as its own crate (icu_segmenter) and as part of the icu crate. See the latter for more details on the ICU4X project.

This module contains segmenter implementation for the following rules.

Examples

Line Break

Segment a string with default options:

use icu_segmenter::LineBreakSegmenter;

let provider = icu_testdata::get_provider();
let segmenter = LineBreakSegmenter::try_new(&provider).expect("Data exists");

let breakpoints: Vec<usize> = segmenter.segment_str("Hello World").collect();
assert_eq!(&breakpoints, &[6, 11]);

Segment a string with CSS option overrides:

use icu_segmenter::{LineBreakOptions, LineBreakRule, LineBreakSegmenter, WordBreakRule};

let mut options = LineBreakOptions::default();
options.line_break_rule = LineBreakRule::Strict;
options.word_break_rule = WordBreakRule::BreakAll;
options.ja_zh = false;
let provider = icu_testdata::get_provider();
let segmenter =
    LineBreakSegmenter::try_new_with_options(&provider, options).expect("Data exists");

let breakpoints: Vec<usize> = segmenter.segment_str("Hello World").collect();
assert_eq!(&breakpoints, &[1, 2, 3, 4, 6, 7, 8, 9, 10, 11]);

Segment a Latin1 byte string:

use icu_segmenter::LineBreakSegmenter;

let provider = icu_testdata::get_provider();
let segmenter = LineBreakSegmenter::try_new(&provider).expect("Data exists");

let breakpoints: Vec<usize> = segmenter.segment_latin1(b"Hello World").collect();
assert_eq!(&breakpoints, &[6, 11]);

Grapheme Cluster Break

Segment a string:

use icu_segmenter::GraphemeClusterBreakSegmenter;
let provider = icu_testdata::get_provider();
let segmenter = GraphemeClusterBreakSegmenter::try_new(&provider).expect("Data exists");

let breakpoints: Vec<usize> = segmenter.segment_str("Hello 🗺").collect();
// World Map (U+1F5FA) is encoded in four bytes in UTF-8.
assert_eq!(&breakpoints, &[0, 1, 2, 3, 4, 5, 6, 10]);

Segment a Latin1 byte string:

use icu_segmenter::GraphemeClusterBreakSegmenter;
let provider = icu_testdata::get_provider();
let segmenter = GraphemeClusterBreakSegmenter::try_new(&provider).expect("Data exists");

let breakpoints: Vec<usize> = segmenter.segment_latin1(b"Hello World").collect();
assert_eq!(&breakpoints, &[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]);

Word Break

Segment a string:

use icu_segmenter::WordBreakSegmenter;
let provider = icu_testdata::get_provider();
let segmenter = WordBreakSegmenter::try_new(&provider).expect("Data exists");

let breakpoints: Vec<usize> = segmenter.segment_str("Hello World").collect();
assert_eq!(&breakpoints, &[0, 5, 6, 11]);

Segment a Latin1 byte string:

use icu_segmenter::WordBreakSegmenter;
let provider = icu_testdata::get_provider();
let segmenter = WordBreakSegmenter::try_new(&provider).expect("Data exists");

let breakpoints: Vec<usize> = segmenter.segment_latin1(b"Hello World").collect();
assert_eq!(&breakpoints, &[0, 5, 6, 11]);

Sentence Break

Segment a string:

use icu_segmenter::SentenceBreakSegmenter;
let provider = icu_testdata::get_provider();
let segmenter = SentenceBreakSegmenter::try_new(&provider).expect("Data exists");

let breakpoints: Vec<usize> = segmenter.segment_str("Hello World").collect();
assert_eq!(&breakpoints, &[0, 11]);

Segment a Latin1 byte string:

use icu_segmenter::SentenceBreakSegmenter;
let provider = icu_testdata::get_provider();
let segmenter = SentenceBreakSegmenter::try_new(&provider).expect("Data exists");

let breakpoints: Vec<usize> = segmenter.segment_latin1(b"Hello World").collect();
assert_eq!(&breakpoints, &[0, 11]);

More Information

For more information on development, authorship, contributing etc. please visit ICU4X home page.

Dependencies