#byte-offset #range #byte-range #text #char #no-std #double-ended

no-std char-ranges

Iterate chars and their start and end byte positions

3 releases

0.1.2 Apr 1, 2024
0.1.1 Jun 5, 2023
0.1.0 Jun 4, 2023

#356 in Text processing

Download history 3/week @ 2024-02-12 14/week @ 2024-02-19 24/week @ 2024-02-26 12/week @ 2024-03-04 14/week @ 2024-03-11 9/week @ 2024-03-18 32/week @ 2024-03-25 247/week @ 2024-04-01 24/week @ 2024-04-08 10/week @ 2024-04-15

314 downloads per month
Used in 5 crates (via text-scanner)

MIT license

31KB
448 lines

char-ranges

CI Latest Version Docs License

Similar to the standard library's .char_indicies(), but instead of only producing the start byte position. This library implements .char_ranges(), that produce both the start and end byte positions.

Note that simply using .char_indicies() and creating a range by mapping the returned index i to i..(i + 1) is not guaranteed to be valid. Given that some UTF-8 characters can be up to 4 bytes.

Char Bytes Range
'O' 1 0..1
'Ø' 2 0..2
'' 3 0..3
'🌏' 4 0..4

Assumes encoded in UTF-8.

The implementation specializes last(), nth(), next_back(), and nth_back(). Such that the length of intermediate characters is not wastefully calculated.

Example

use char_ranges::CharRangesExt;

let text = "Hello 🗻∈🌏";

let mut chars = text.char_ranges();
assert_eq!(chars.as_str(), "Hello 🗻∈🌏");

assert_eq!(chars.next(), Some((0..1, 'H'))); // These chars are 1 byte
assert_eq!(chars.next(), Some((1..2, 'e')));
assert_eq!(chars.next(), Some((2..3, 'l')));
assert_eq!(chars.next(), Some((3..4, 'l')));
assert_eq!(chars.next(), Some((4..5, 'o')));
assert_eq!(chars.next(), Some((5..6, ' ')));

// Get the remaining substring
assert_eq!(chars.as_str(), "🗻∈🌏");

assert_eq!(chars.next(), Some((6..10, '🗻'))); // This char is 4 bytes
assert_eq!(chars.next(), Some((10..13, ''))); // This char is 3 bytes
assert_eq!(chars.next(), Some((13..17, '🌏'))); // This char is 4 bytes
assert_eq!(chars.next(), None);

DoubleEndedIterator

CharRanges also implements DoubleEndedIterator making it possible to iterate backwards.

use char_ranges::CharRangesExt;

let text = "ABCDE";

let mut chars = text.char_ranges();
assert_eq!(chars.as_str(), "ABCDE");

assert_eq!(chars.next(), Some((0..1, 'A')));
assert_eq!(chars.next_back(), Some((4..5, 'E')));
assert_eq!(chars.as_str(), "BCD");

assert_eq!(chars.next_back(), Some((3..4, 'D')));
assert_eq!(chars.next(), Some((1..2, 'B')));
assert_eq!(chars.as_str(), "C");

assert_eq!(chars.next(), Some((2..3, 'C')));
assert_eq!(chars.as_str(), "");

assert_eq!(chars.next(), None);

Offset Ranges

If the input text is a substring of some original text, and the produced ranges are desired to be offset in relation to the substring. Then instead of .char_ranges() use .char_ranges_offset(offset) or .char_ranges().offset(offset).

use char_ranges::CharRangesExt;

let text = "Hello 👋 World 🌏";

let start = 11; // Start index of 'W'
let text = &text[start..]; // "World 🌏"

let mut chars = text.char_ranges_offset(start);
// or
// let mut chars = text.char_ranges().offset(start);

assert_eq!(chars.next(), Some((11..12, 'W'))); // These chars are 1 byte
assert_eq!(chars.next(), Some((12..13, 'o')));
assert_eq!(chars.next(), Some((13..14, 'r')));

assert_eq!(chars.next_back(), Some((17..21, '🌏'))); // This char is 4 bytes

No runtime deps