5 releases
0.1.4 | Oct 26, 2024 |
---|---|
0.1.3 | Aug 12, 2024 |
0.1.2 | Nov 8, 2023 |
0.1.1 | Aug 30, 2023 |
0.1.0 | Aug 28, 2023 |
#1492 in Data structures
244 downloads per month
Used in 3 crates
(via synoptic)
21KB
327 lines
char_index
A crate for ~O(1) charwise indexing into a string, without normally using as much memory as a Vec<char>
.
Benchmarks
1000 "e"'s with 3 extra non ascii chars, shuffled:
Implementation | Memory Use | Index 200th codepoint |
---|---|---|
Vec<char> |
4,012B (+ 24B direct) | 0.6ns |
IndexedChars |
2,009B (+ 64B direct) | 4ns |
String |
1,006B (+ 24B direct) | 126ns |
(data collected using benches/char_index.rs
)
no_std
This crate is fully no_std
, however it does rely on alloc.
A std feature may be added at a later date, but it is currently unknown what that would include.
License
This crate is licensed under MPL-2.0, this is a weak copyleft license intended to keep any modifications to the core library open source, without affecting other requirements greatly. This is not legal advice.
lib.rs
:
char_index
A crate that provides a tradeoff of space efficiency and apparent O(1) charwise indexing.
To get started, create a new IndexedChars
or OwnedIndexedChars
instance.
How it Works
IndexedChars
works by allocating a Vec<u8>
under the hood that stores char
offsets from its own index in the Vec to the location of the char in the backing string.
This takes advantage of the fact that utf8's minimum size is at least 1 byte per character.
This has a tradeoff where after 255 bytes worth of string data that is not a single byte or,
the sum of all the characters that exceed one bytes lengths minus the first byte, has been added
to the string, it can no longer store data, as it would overflow u8.
This is where the rollovers structure takes effect, it is a Vec<usize>
whose sole purpose is to
store indexes where a rollover has occurred, this can then be trivially binary searched to find the
current index we are working with, and then apply the length of the list at that point multiplied by u8::MAX
to find the true offset of the character.
In this way, we achieve what behaves as an O(1) char lookup (technically worst case O(log n)) for most strings, while saving memory over Vec<char>
(sans the case where the string is only made up of 4 byte characters, which acts as the worst case for time complexity too).
Additionally, as a niche optimization, if the string contains only ascii (all offsets 0); it will simply not allocate any extra memory, and gain perfect O(1) lookup.