#emoji #Unicode #ASCII #transliteration #unidecode

deunicode

Convert Unicode strings to pure ASCII by intelligently transliterating them. Suppors Emoji and Chinese.

5 releases (1 stable)

1.0.0 Dec 22, 2018
0.4.3 Sep 13, 2018
0.4.2 Sep 7, 2018
0.4.1 Sep 7, 2018
0.4.0 May 5, 2018

#38 in Text processing

Download history 1618/week @ 2019-03-27 1709/week @ 2019-04-03 2138/week @ 2019-04-10 2022/week @ 2019-04-17 2220/week @ 2019-04-24 1991/week @ 2019-05-01 2142/week @ 2019-05-08 2016/week @ 2019-05-15 2466/week @ 2019-05-22 2329/week @ 2019-05-29 2169/week @ 2019-06-05 3102/week @ 2019-06-12 3855/week @ 2019-06-19 4510/week @ 2019-06-26 3342/week @ 2019-07-03

8,277 downloads per month
Used in 96 crates (5 directly)

BSD-3-Clause

110KB
113 lines

deunicode

Documentation

The deunicode library transliterates Unicode strings such as "Æneid" into pure ASCII ones such as "AEneid."

It started as a Rust port of Text::Unidecode Perl module, and was extended to support emoji.

This is a fork of unidecode crate. This fork uses a compact representation of Unicode data to minimize memory overhead and executable size.

Examples

extern crate deunicode;
use deunicode::deunicode;

assert_eq!(deunicode("Æneid"), "AEneid");
assert_eq!(deunicode("étude"), "etude");
assert_eq!(deunicode("北亰"), "Bei Jing");
assert_eq!(deunicode("ᔕᓇᓇ"), "shanana");
assert_eq!(deunicode("げんまい茶"), "genmaiCha");
assert_eq!(deunicode("🦄☣"), "unicorn face biohazard");

Guarantees and Warnings

Here are some guarantees you have when calling deunicode():

  • The String returned will be valid ASCII; the decimal representation of every char in the string will be between 0 and 127, inclusive.
  • Every ASCII character (0x00 - 0x7F) is mapped to itself.
  • All Unicode characters will translate to printable ASCII characters (\n or characters in the range 0x20 - 0x7E).

There are, however, some things you should keep in mind:

  • As stated, some transliterations do produce \n characters.
  • Some Unicode characters transliterate to an empty string, either on purpose or because deunicode does not know about the character.
  • Some Unicode characters are unknown and transliterate to "[?]" (or a custom placeholder, or None if you use a chars iterator).
  • Many Unicode characters transliterate to multi-character strings. For example, "北" is transliterated as "Bei".
  • Han characters used in multiple languages are mapped to Mandarin, and will be mostly illegible to Japanese readers.

Unicode data

For a detailed explanation on the rationale behind the original dataset, refer to this article written by Burke in 2001.

No runtime deps