#string #index #chars

no-std char_index

A crate for efficient charwise indexing into a string

3 releases

0.1.2 Nov 8, 2023
0.1.1 Aug 30, 2023
0.1.0 Aug 28, 2023

#1785 in Data structures

43 downloads per month

MPL-2.0 license

20KB
308 lines

char_index

A crate for ~O(1) charwise indexing into a string, without normally using as much memory as a Vec<char>.

Benchmarks

1000 "e"'s with 3 extra non ascii chars, shuffled:

Implementation Memory Use Index 200th codepoint
Vec<char> 4,012B (+ 24B direct) 0.6ns
IndexedChars 2,009B (+ 64B direct) 4ns
String 1,006B (+ 24B direct) 126ns

(data collected using benches/char_index.rs)

no_std

This crate is fully no_std, however it does rely on alloc.
A std feature may be added at a later date, but it is currently unknown what that would include.

License

This crate is licensed under MPL-2.0, this is a weak copyleft license intended to keep any modifications to the core library open source, without affecting other requirements greatly. This is not legal advice.


lib.rs:

char_index

A crate that provides a tradeoff of space efficiency and apparent O(1) charwise indexing.

To get started, create a new IndexedChars or OwnedIndexedChars instance.

How it Works

IndexedChars works by allocating a Vec<u8> under the hood that stores char offsets from its own index in the Vec to the location of the char in the backing string. This takes advantage of the fact that utf8's minimum size is at least 1 byte per character.

This has a tradeoff where after 255 bytes worth of string data that is not a single byte or, the sum of all the characters that exceed one bytes lengths minus the first byte, has been added to the string, it can no longer store data, as it would overflow u8. This is where the rollovers structure takes effect, it is a Vec<usize> whose sole purpose is to store indexes where a rollover has occurred, this can then be trivially binary searched to find the current index we are working with, and then apply the length of the list at that point multiplied by u8::MAX to find the true offset of the character.

In this way, we achieve what behaves as an O(1) char lookup (technically worst case O(log n)) for most strings, while saving memory over Vec<char> (sans the case where the string is only made up of 4 byte characters, which acts as the worst case for time complexity too).

Additionally, as a niche optimization, if the string contains only ascii (all offsets 0); it will simply not allocate any extra memory, and gain perfect O(1) lookup.

No runtime deps