1 stable release
new 1.0.0 | Apr 18, 2025 |
---|
#676 in Database interfaces
20KB
351 lines
Enumerated Latin
Enumerated Latin is a crate to map strings made of the 26 letters a
to z
or A
to Z
(case insensitive) to a continuous space of integers by treating the text like a base26 encoded number plus an end marker.
Example:
use enumerated_latin::EnumeratedLatinEncode;
use enumerated_latin::EnumeratedLatinDecode;
let encoded: u64 = "Example".enumerated_latin_encode().unwrap();
assert_eq!(encoded, 9540966270);
let decoded_again = encoded.enumerated_latin_decode_lowercase().unwrap();
assert_eq!(decoded_again, "example".to_string());
Intended use
Intended use of this is to generate numeric identifiers for short pieces of text, while still allowing to compare against ranges in fixed-length scenarios.
This arises — for example — when working with ISO-codes for languages, scripts countries etc. preserving the order within the same length helps with efficiently checking against private-use and similar ranges.
Intended area of use is in the backend of applications, where the difference between a string and a number actually matters.
For frontends it is recommended to prefer readability over performance whenever possible.
How the encoding works
In short: The string prefixed with a b
and then parsed like a most significant first (same order as everyday numbers) base26 number, where a
maps to 0
and z
to 25
.
Example: az
would be encoded as baz
: (26^2)*1 + (26^1)*0 + (26^0)*25 = 701
use enumerated_latin::EnumeratedLatinEncode;
assert_eq!("az".enumerated_latin_encode(), Ok(701 as u16))
The b
at the start is because with a
mapping to zero, leading a
s act like leading 0
s in everyday base10 numbers, there is no way from the numeric value to tell how many of them were present. The trailing b
ensures, that one can always deduce the original length from the numeric value.
The everyday base10 equivalent to prepending the b
would be prepending a 1
i.e. 000
to 1000
and 00
to 100
.
This results in the following facts about the encoding:
- An empty string encodes to a
1
- The first valid non-empty string is
a
with a value of26
- Within the same length, the encoded strings sort alphabetically
- Longer string means bigger number
- There is a gap in the encoding space between different length strings
- Assuming a length
l
, the first value is26^l
and the last one is((26^l)*2)-1)
.
Encoding targets
Encoding each letter takes roughly 5 bits of information plus one bit for the end cap, you can use this information to roughly estimate which datatype you'll need.
Valid encoding target types are:
Type | supported length |
---|---|
u8 |
1 |
i16 |
2 |
u16 |
3 |
i32 |
6 |
u32 |
6 |
i64 |
13 |
u64 |
13 |
i128 |
26 |
u128 |
26 |
Licensing
enumerated_latin
is licensed as LGPL-3.0-only
and REUSE 3.3 compliant.
When contributing add yourself as a copyright holder to the files you modified.