#unicode #convert #unicode-text #encoding #data #utf-16 #cli

bin+lib unicode_converter

A library and a CLI tool to convert data between various Unicode encodings

3 releases

0.1.2 Aug 10, 2023
0.1.1 Aug 10, 2023
0.1.0 Mar 28, 2022

#1798 in Text processing

BSD-3-Clause

58KB
1K SLoC

Unicode converter

This repository contains both a library and a CLI tool to convert data between various Unicode encodings.

The supported encodings are:

  • UTF-8
  • CESU-8
  • UTF-16
  • UTF-32
  • UTF-1

CLI tool

The CLI tool is meant to be a demonstration of the library but it can be used on its own if needed. It is made in a single file, str/main.rs.

Usage

A tool to convert Unicode text files between multiple Unicode encodings. The available encodings are
UTF-8, UTF-1, CESU-8, UTF-16, and UTF-32. By default, the data is assumed to be little-endian, but for encodings
with multi-byte words such as UTF-16 or UTF-32, you can add the `_be` suffix to indicate that you
want to work with big-endian data

USAGE:
    unicode_converter [OPTIONS] --input-file <INPUT_FILE> --decoding-input <DECODING_INPUT> --encoding-output <ENCODING_OUTPUT>

OPTIONS:
    -d, --decoding-input <DECODING_INPUT>
            Input file encoding

    -e, --encoding-output <ENCODING_OUTPUT>
            Output file encoding

    -h, --help
            Print help information

    -i, --input-file <INPUT_FILE>
            Input file used as input. You can use `-` if you mean `/dev/stdin`

    -o, --output-file <OUTPUT_FILE>
            Output file [default: /dev/stdout]

Compilation

To compile it, simply run cargo build as it is the only executable crate in this repository.

Library

All the code in src/ except for src/main.rs makes the Unicode encoding converting library.

Behavior

The various Unicode encodings are all made with their own type implementing the UnicodeEncoding trait. Running cargo doc will give you complete information but the intended way of using the library is the following:

  • Read data from a file or a slice of bytes. For example, too read UTF-16 data from a file, do let content = Utf16::from_file("filename.txt", false).unwrap();. Note the false used to indicate that the encoding is little-endian.
  • Then, convert it to an other encoding. For example, to convert to UTF-8: let converted = content.convert_to::<Utf8>();.
  • Finally, you can write the converted data to a new file. converted.to_file("new_file.txt", false);. As UTF-8 is only on one byte, the boolean argument to take care of the endianess is ignored.

Dependencies

~3MB
~63K SLoC