4 releases

0.2.4	Apr 9, 2023
0.2.3	Apr 8, 2023
0.2.1	Apr 3, 2023
0.1.1	Nov 25, 2022

#825 in Encoding

Apache-2.0/MIT

230KB
3.5K SLoC

CESU-8 encoder/decoder for Rust

Convert between ordinary UTF-8 and CESU-8 (Wikipedia, Unicode Consortium) encodings.

CESU-8 encodes characters outside the Basic Multilingual Plane as two UTF-16 surrogate chacaters, which are then further re-encoded as invalid, 3-byte UTF-8 characters. This means that 4-byte UTF-8 sequences become 6-byte CESU-8 sequences.

Note that CESU-8 is only intended for internal use within tightly-coupled systems, and not for data interchange.

This encoding is sometimes needed when working with Java, Oracle or MySQL, and when trying to store emoji, hieroglyphs, or other characters on the Supplementary Multilingual Plane or the Supplementary Ideographic Plane.

We also support Java's Modified UTF-8 encoding, which additionally encodes \0 using a multi-byte UTF-8 sequence.

Commandline Utilities

Installing: cargo install cesu8str --features=build-binary

cesu8str

encode/decode between MUTF8/CESU8 and UTF8

$ cesu8str --help
Converts files or standard IO streams between standard UTF8 and CESU8, or the JVM's modified UTF-8.
Note that the default Windows' console does not support non-UTF8 sequences - attempting to type/print them will result in 
an error.

This tool will immediately exit upon finding an invalid character sequence.

EXIT CODES:
0 - if completed normally
1 - if an IO error has occured
2 - if an encoding error has occured (invalid/incomplete character sequences/etc)


Usage: cesu8str [OPTIONS] [INPUT] [OUTPUT]

Arguments:
  [INPUT]   The input file. Defaults to stdin if '-' or not set
  [OUTPUT]  The output file. Defaults to stdout if '-' or not set

Options:
  -j, --java     Toggles the use of the JVM's modified UTF8. In effect, it encodes nul-bytes as 0xC0,0x80 while nul-bytes 
are left alone in normal mode
  -d, --decode   Decodes CESU8 text into standard UTF8. By default, this tool encodes UTF8 to CESU8
  -h, --help     Print help information
  -V, --version  Print version information

count_codepoints - count unicode code point lengths for UTF8/16 and friends

More of a utility tool for verifying functionality during development.

$ count_codepoints --help
USAGE: count_codepoints [--json] [file] [KB_chunk_size=4]

Reads a file and emits the number of occurences differing lengths of UTF-8 codepoints on stdout.

If `file` is not supplied, stdin will be read.

Exit Codes:
0       Success
1       Command-line error
2       IO Error

Example output:

$ count_codepoints ./test_files/random.txt      
len4: 1001
len3: 1013
len2: 966
ascii: 1095
nul: 0
total (chars/bytes): 4075/10070

$ count_codepoints ./test_files/random.cesu8.txt
len4: 0
len3: 3015
len2: 966
ascii: 1095
nul: 0
total (chars/bytes): 5076/1207

License

Some of this code is adapted from Rust's src/libcore/str.rs file. This code is covered by LICENSE-RUST.txt and copyright by The Rust Project Developers and individual Rust contributors, as described in that file.

Most code in this project was forked, and adapted from, Eric Kidd's cesu8 crate. That adapted code is distributed under the same terms as the original project at time of fork.

The new code in this project is distributed under the same terms.

Dependencies

~0–255KB