#encoding #unicode #i18n #cjk #reader


Converts text encoding the easy and efficient way

2 releases

0.1.1 Oct 24, 2021
0.1.0 Oct 24, 2021

#222 in Internationalization (i18n)

26 downloads per month
Used in aconv


827 lines


This is a transcoding library. Transcoding here means converting text encoding to another.

There are two excellent crates chardetng and encoding_rs. chardetng is created for encoding detection and encoding_rs can be used for transcoding. This library aims to transcode the easy and efficient way by combining these two crates.

Note: Supported encodings are the ones defined in the Encoding Standard.

Note: UTF-16 files are needed to have a BOM to be detected as the encoding.
This is because chardetng, on which this library depends, does not support UTF-16 and this library only added BOM sniffing to detect UTF-16.


See the document.

How encoding detection works.

Since texts are internally just byte sequences, there is no way to detect the right encoding with 100% accuracy.
So we need to guess the right encoding somehow.
The below is the flow we roughly follow.

  1. Do BOM sniffing to detect UTF-16.
    If a BOM is found, skip guessing the encoding.
  2. Guess the encoding using chardetng.
  3. Decode texts using encoding_rs.
  4. Check the decoded texts if there are non-text characters, which are described below.
    If non-text characters do not exceed the threshold, output the decoded texts.
    Otherwise, emit an error message and output the input texts as it is.

Non-text characters

Characters that are treated as non-text in this library are the same ones in the file command, plus the REPLACEMENT CHARACTER.
Namely, U+0000 ~ U+0006, U+000e ~ U+001a, U+001c ~ U+001f, U+007f, and U+FFFD are treated as the non-text characters.


Licensed under either of

at your option.


Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.


~127K SLoC