3 stable releases

1.2.1	Jun 5, 2024
1.2.0	Jun 1, 2024
1.1.0	May 26, 2024
1.0.2	~~May 21, 2024~~
1.0.0	~~May 9, 2024~~

#864 in Text processing

MIT license

150KB
2.5K SLoC

CESU-8 Encoder & Decoder

Converts between normal UTF-8 and CESU-8 encodings.

CESU-8 encodes characters outside the Basic Multilingual Plane as two UTF-16 surrogate characters, which are then re-encoded as 3-byte UTF-8 characters. This means that 4-byte UTF-8 sequences become 6-byte CESU-8 sequences.

We also support Java's Modified UTF-8 encoding, which uses a variant of CESU-8 encoding \0 using a two-byte sequence.

`lib.rs`:

A library implementing the CESU-8 compatibility encoding scheme. This is a non-standard variant of UTF-8 that is used internally by some systems that need to represent UTF-16 data as 8-bit characters.

The use of this encoding is discouraged by the Unicode Consortium. It's OK for working with existing APIs, but it should not be used for data trasmission or storage.

Java and U+0000

Java uses the CESU-8 encoding as described above, but with one difference: the null character U+0000 is represented as an overlong UTF-8 sequence C0 80. This is supported by JavaStr and JavaString.

Surrogate pairs and UTF-8

The UTF-16 encoding uses "surrogate pairs" to represent Unicode code points in the range from U+10000 to U+10FFFF. These are 16-bit numbers in the range 0xD800 to 0xDFFF.

CESU-8 encodes these surrogate pairs as a 6-byte seqence consisting of two sets of three bytes.

Crate features

Alloc - Enables all allocation related features. This will allow usage of Cesu8String and JavaString, which offer a similiar API to the standard library's String.

No runtime deps

Features

alloc