3 stable releases
1.2.1 | Jun 5, 2024 |
---|---|
1.2.0 | Jun 1, 2024 |
1.1.0 | May 26, 2024 |
1.0.2 |
|
1.0.0 |
|
#902 in Text processing
397 downloads per month
150KB
2.5K
SLoC
CESU-8 Encoder & Decoder
Converts between normal UTF-8 and CESU-8 encodings.
CESU-8 encodes characters outside the Basic Multilingual Plane as two UTF-16 surrogate characters, which are then re-encoded as 3-byte UTF-8 characters. This means that 4-byte UTF-8 sequences become 6-byte CESU-8 sequences.
We also support Java's Modified UTF-8 encoding, which uses a variant of CESU-8
encoding \0
using a two-byte sequence.
lib.rs
:
A library implementing the CESU-8 compatibility encoding scheme. This is a non-standard variant of UTF-8 that is used internally by some systems that need to represent UTF-16 data as 8-bit characters.
The use of this encoding is discouraged by the Unicode Consortium. It's OK for working with existing APIs, but it should not be used for data trasmission or storage.
Java and U+0000
Java uses the CESU-8 encoding as described above, but with one difference:
the null character U+0000 is represented as an overlong UTF-8 sequence C0 80
. This is supported by JavaStr
and JavaString
.
Surrogate pairs and UTF-8
The UTF-16 encoding uses "surrogate pairs" to represent Unicode code points in the range from U+10000 to U+10FFFF. These are 16-bit numbers in the range 0xD800 to 0xDFFF.
CESU-8 encodes these surrogate pairs as a 6-byte seqence consisting of two sets of three bytes.
Crate features
Alloc - Enables all allocation related features. This will allow usage
of Cesu8String
and JavaString
, which offer a similiar API to the
standard library's String
.