#unsigned-integer #64-bit #byte #variables #length #numbers

no-std varu64

A variable length encoding for unsigned 64 bit integers

12 releases (6 breaking)

0.7.0 Oct 19, 2021
0.6.2 Nov 4, 2020
0.6.1 Oct 31, 2018
0.5.0 Oct 16, 2018
0.1.0 Mar 13, 2018

#1648 in Encoding

Download history 44/week @ 2023-11-27 41/week @ 2023-12-04 20/week @ 2023-12-11 48/week @ 2023-12-18 23/week @ 2023-12-25 44/week @ 2024-01-08 58/week @ 2024-01-15 78/week @ 2024-01-22 15/week @ 2024-01-29 21/week @ 2024-02-05 22/week @ 2024-02-12 22/week @ 2024-02-19 54/week @ 2024-02-26 52/week @ 2024-03-04 43/week @ 2024-03-11

175 downloads per month
Used in 7 crates (5 directly)

AGPL-3.0

34KB
693 lines

VarU64

A simple variable-length encoding for unsigned 64 bit integers.

Specification

To decode a varu64, look at the first byte. If its value is below 248, the value itself is the encoded number. Else, the first byte determines the further length of the encoding:

first byte number of additional bytes
248 1
249 2
250 3
251 4
252 5
253 6
254 7
255 8

Following the first byte are length many bytes. These bytes are the big-endian representation of the encoded number.

Of all possible representations for a number that this scheme admits, the shortest one is its unique, valid encoding. Decoders must indicate an error if a value uses an encoding that is longer than necessary.

Remarks/Properties

Whether the first byte signifies a length can be checked efficiently by testing whether the first 5 bits are set to 1. In that case, the length itself is 1 plus the value of the last three bits. In other words, additional_bytes = (first_byte & 0xf8 == 0xf8) ? (first_byte & 7) + 1 : 0.

The length of an encoded value can be determined by solely looking at the first byte.

The canonical encoding means that if there are two or more additional bytes, then there may be no leading zero bytes. If there is just one additional byte, then the additional byte must be >= 248. That means 0-247 is encoded in one byte, but 248-255 is encoded in two bytes.

Due to the canonicity requirement of only allowing the shortest possible encoding, there is a bijection between unsigned 64 bit integers and encodings.

The cost for the simplicity and canonicity of this format are a (somewhat) large number of unused byte strings. On the plus side, these can be used as extension points.

Related work: This has been inspired by the issues in the multiformats varint repository, in particular issues #8 and #12.

License

The specification (this file) is licensed as CC BY-SA 4.0, the code in this repository is licensed under AGPL-3.0

Dependencies

~1.5MB
~35K SLoC