5 releases (3 breaking)
1.0.0-alpha1 | Aug 5, 2022 |
---|---|
0.8.0 | Jan 26, 2023 |
0.7.0 | Sep 27, 2022 |
0.6.0 | May 17, 2022 |
0.0.1 | Apr 29, 2021 |
#12 in #cldr
1,038 downloads per month
Used in 9 crates
(4 directly)
3.5MB
18K
SLoC
icu_segmenter 
[Experimental] Segment strings by lines, graphemes, word, and sentences.
This module is published as its own crate (icu_segmenter
)
and as part of the icu
crate. See the latter for more details on the ICU4X project.
This module contains segmenter implementation for the following rules.
- Line breaker that is compatible with Unicode Standard Annex #14 and CSS properties.
- Grapheme cluster breaker, word breaker, and sentence breaker that are compatible with Unicode Standard Annex #29.
Examples
Line Break
Segment a string with default options:
use icu_segmenter::LineBreakSegmenter;
let provider = icu_testdata::get_provider();
let segmenter = LineBreakSegmenter::try_new(&provider).expect("Data exists");
let breakpoints: Vec<usize> = segmenter.segment_str("Hello World").collect();
assert_eq!(&breakpoints, &[6, 11]);
Segment a string with CSS option overrides:
use icu_segmenter::{LineBreakOptions, LineBreakRule, LineBreakSegmenter, WordBreakRule};
let mut options = LineBreakOptions::default();
options.line_break_rule = LineBreakRule::Strict;
options.word_break_rule = WordBreakRule::BreakAll;
options.ja_zh = false;
let provider = icu_testdata::get_provider();
let segmenter =
LineBreakSegmenter::try_new_with_options(&provider, options).expect("Data exists");
let breakpoints: Vec<usize> = segmenter.segment_str("Hello World").collect();
assert_eq!(&breakpoints, &[1, 2, 3, 4, 6, 7, 8, 9, 10, 11]);
Segment a Latin1 byte string:
use icu_segmenter::LineBreakSegmenter;
let provider = icu_testdata::get_provider();
let segmenter = LineBreakSegmenter::try_new(&provider).expect("Data exists");
let breakpoints: Vec<usize> = segmenter.segment_latin1(b"Hello World").collect();
assert_eq!(&breakpoints, &[6, 11]);
Grapheme Cluster Break
Segment a string:
use icu_segmenter::GraphemeClusterBreakSegmenter;
let provider = icu_testdata::get_provider();
let segmenter = GraphemeClusterBreakSegmenter::try_new(&provider).expect("Data exists");
let breakpoints: Vec<usize> = segmenter.segment_str("Hello 🗺").collect();
// World Map (U+1F5FA) is encoded in four bytes in UTF-8.
assert_eq!(&breakpoints, &[0, 1, 2, 3, 4, 5, 6, 10]);
Segment a Latin1 byte string:
use icu_segmenter::GraphemeClusterBreakSegmenter;
let provider = icu_testdata::get_provider();
let segmenter = GraphemeClusterBreakSegmenter::try_new(&provider).expect("Data exists");
let breakpoints: Vec<usize> = segmenter.segment_latin1(b"Hello World").collect();
assert_eq!(&breakpoints, &[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]);
Word Break
Segment a string:
use icu_segmenter::WordBreakSegmenter;
let provider = icu_testdata::get_provider();
let segmenter = WordBreakSegmenter::try_new(&provider).expect("Data exists");
let breakpoints: Vec<usize> = segmenter.segment_str("Hello World").collect();
assert_eq!(&breakpoints, &[0, 5, 6, 11]);
Segment a Latin1 byte string:
use icu_segmenter::WordBreakSegmenter;
let provider = icu_testdata::get_provider();
let segmenter = WordBreakSegmenter::try_new(&provider).expect("Data exists");
let breakpoints: Vec<usize> = segmenter.segment_latin1(b"Hello World").collect();
assert_eq!(&breakpoints, &[0, 5, 6, 11]);
Sentence Break
Segment a string:
use icu_segmenter::SentenceBreakSegmenter;
let provider = icu_testdata::get_provider();
let segmenter = SentenceBreakSegmenter::try_new(&provider).expect("Data exists");
let breakpoints: Vec<usize> = segmenter.segment_str("Hello World").collect();
assert_eq!(&breakpoints, &[0, 11]);
Segment a Latin1 byte string:
use icu_segmenter::SentenceBreakSegmenter;
let provider = icu_testdata::get_provider();
let segmenter = SentenceBreakSegmenter::try_new(&provider).expect("Data exists");
let breakpoints: Vec<usize> = segmenter.segment_latin1(b"Hello World").collect();
assert_eq!(&breakpoints, &[0, 11]);
More Information
For more information on development, authorship, contributing etc. please visit ICU4X home page
.