11 releases (7 stable)
|new 1.3.1||Jul 24, 2021|
|1.2.1||Apr 12, 2021|
|1.2.0||Mar 24, 2021|
|1.1.1||Apr 23, 2020|
|0.4.0||May 5, 2018|
#19 in Text processing
129,948 downloads per month
Used in 132 crates (14 directly)
deunicode library transliterates Unicode strings such as "Æneid" into pure
ASCII ones such as "AEneid."
It started as a Rust port of
Text::Unidecode Perl module, and was extended to support emoji.
This is a fork of unidecode crate. This fork uses a compact representation of Unicode data to minimize memory overhead and executable size (about 70K codepoints mapped to 240K ASCII characters, using 450KB or memory, 160KB gzipped).
extern crate deunicode; use deunicode::deunicode; assert_eq!(deunicode("Æneid"), "AEneid"); assert_eq!(deunicode("étude"), "etude"); assert_eq!(deunicode("北亰"), "Bei Jing"); assert_eq!(deunicode("ᔕᓇᓇ"), "shanana"); assert_eq!(deunicode("げんまい茶"), "genmaiCha"); assert_eq!(deunicode("🦄☣"), "unicorn biohazard");
Here are some guarantees you have when calling
Stringreturned will be valid ASCII; the decimal representation of every
charin the string will be between 0 and 127, inclusive.
- Every ASCII character (0x00 - 0x7F) is mapped to itself.
- All Unicode characters will translate to printable ASCII characters
\nor characters in the range 0x20 - 0x7E).
There are, however, some things you should keep in mind:
- Some transliterations do produce
- Some Unicode characters transliterate to an empty string, either on purpose
deunicodedoes not know about the character.
- Some Unicode characters are unknown and transliterate to
"[?]"(or a custom placeholder, or
Noneif you use a chars iterator).
- Many Unicode characters transliterate to multi-character strings. For example, "北" is transliterated as "Bei".
- Transliteration is context-free and not sophisticated enough to produce proper Chinese or Japanese. Han characters used in multiple languages are mapped to a single Mandarin pronounciation, and will be mostly illegible to Japanese readers. Transliteration can't handle cases where a single character has multiple possible pronounciations.
For a detailed explanation on the rationale behind the original dataset, refer to this article written by Burke in 2001.