3 stable releases

1.1.1 Apr 3, 2023
1.0.1 Oct 19, 2019

#76 in Internationalization (i18n)

Download history 26/week @ 2024-02-24 1/week @ 2024-03-02 42/week @ 2024-03-30 6/week @ 2024-04-06 18/week @ 2024-04-13

66 downloads per month

GPL-3.0-only

7KB
69 lines

utf8-norm, validate and normalize UTF-8 Unicode data

ABOUT

Version 1.1.0 licensed GPLv3. (C) 2019 Leonora Tindall nora@nora.codes Fast command line Unicode normalization, supporting stream safety transformations as well as NFC, NFD, NFKD, and NFKC. Exits with failure if the incoming stream is not valid UTF-8.

USAGE

Usage: utf8-norm [--nfc | --nfd | --nfkc | --nfkd] [--stream-safe] [--crlf]

<infile> (default stdin) - file from which to read bytes.
<outfile> (default stdout) - file to which to write normalized Unicode.
-w, --crlf  - write CRLF (Windows) instead of LF only (Unix) at the end of lines.
-d, --nfd   - write NFD (canonical decomposition).
-D, --nfkd  - write NFKD (compatibility decomposition).
-c, --nfc   - write NFC (canonical composition computed from NFD). This is the default.
-C, --nfkc  - write NFKC (canonical composition computed from NFC).
-s, --stream-safe   - write stream-safe bytes (Conjoining Grapheme Joiners, UAX15-D4).
-b, --buffered  - read the entire input file into memory before operating on it.
-V, --version - output version information and exit.

utf8-norm operates linewise on the input unless --buffered is specified.

The --buffered option is primarily useful for reading and writing to the same file. It will read bytes from the input until end of file and only then begin processing lines of the input.

EXAMPLES

Write the contents of input.txt, compatibly decomposed, with CRLF line endings, to output.txt:

utf8-norm --nfkd --crlf input.txt output.txt

Normalize file.md, in the canonical composition, buffering the file in memory to avoid overwriting it with zeros:

utf8-norm --buffered file.md file.md

Emit the output of my_program to stdout, in the canonical composition, linewise.

my_program | utf8-norm

Buffer the entire output of my_program in memory, and emit it to my_program.output in the canonical composition after receiving end-of-file.

my_program | utf8-norm --buffered - my_program.out

ABOUT

utf8-norm was created at Rust Belt Rust 2019 in Dayton, OH. Thanks to @j41manning for her excellent talk regarding Unicode handling in Rust.

Natively install as cargo install utf8-norm or from your distribution's package manager.

Dependencies

~1MB
~41K SLoC