1 stable release

1.0.0	Apr 18, 2025

#561 in Database interfaces

136 downloads per month

LGPL-3.0-only

20KB
351 lines

Enumerated Latin

Enumerated Latin is a crate to map strings made of the 26 letters a to z or A to Z (case insensitive) to a continuous space of integers by treating the text like a base26 encoded number plus an end marker.

Example:

use enumerated_latin::EnumeratedLatinEncode;
use enumerated_latin::EnumeratedLatinDecode;

let encoded: u64 = "Example".enumerated_latin_encode().unwrap();

assert_eq!(encoded, 9540966270);

let decoded_again = encoded.enumerated_latin_decode_lowercase().unwrap();

assert_eq!(decoded_again, "example".to_string());

Intended use

Intended use of this is to generate numeric identifiers for short pieces of text, while still allowing to compare against ranges in fixed-length scenarios.

This arises — for example — when working with ISO-codes for languages, scripts countries etc. preserving the order within the same length helps with efficiently checking against private-use and similar ranges.

Intended area of use is in the backend of applications, where the difference between a string and a number actually matters.

For frontends it is recommended to prefer readability over performance whenever possible.

How the encoding works

In short: The string prefixed with a b and then parsed like a most significant first (same order as everyday numbers) base26 number, where a maps to 0 and z to 25.

Example: az would be encoded as baz: (26^2)*1 + (26^1)*0 + (26^0)*25 = 701

use enumerated_latin::EnumeratedLatinEncode;

assert_eq!("az".enumerated_latin_encode(), Ok(701 as u16))

The b at the start is because with a mapping to zero, leading as act like leading 0s in everyday base10 numbers, there is no way from the numeric value to tell how many of them were present. The trailing b ensures, that one can always deduce the original length from the numeric value.

The everyday base10 equivalent to prepending the b would be prepending a 1 i.e. 000 to 1000 and 00 to 100.

This results in the following facts about the encoding:

An empty string encodes to a 1
The first valid non-empty string is a with a value of 26
Within the same length, the encoded strings sort alphabetically
Longer string means bigger number
There is a gap in the encoding space between different length strings
Assuming a length l, the first value is 26^l and the last one is ((26^l)*2)-1).

Encoding targets

Encoding each letter takes roughly 5 bits of information plus one bit for the end cap, you can use this information to roughly estimate which datatype you'll need.

Valid encoding target types are:

Type	supported length
`u8`	1
`i16`	2
`u16`	3
`i32`	6
`u32`	6
`i64`	13
`u64`	13
`i128`	26
`u128`	26

Licensing

enumerated_latin is licensed as LGPL-3.0-only and REUSE 3.3 compliant.

When contributing add yourself as a copyright holder to the files you modified.