10 stable releases
1.4.3 | Feb 16, 2024 |
---|---|
1.4.2 | Dec 10, 2023 |
1.4.1 | Oct 15, 2023 |
1.3.3 | Dec 15, 2022 |
0.4.0 |
|
#19 in Text processing
507,430 downloads per month
Used in 465 crates
(40 directly)
180KB
163 lines
deunicode
The deunicode
library transliterates Unicode strings such as "Æneid" into pure
ASCII ones such as "AEneid." Includes support for emoji. It's compatible with no-std Rust environments.
This is a maintained alternative to the unidecode crate, which started as a Rust port of Text::Unidecode
Perl module.
Deunicode is quite fast, and uses a compact representation of Unicode data to minimize memory overhead and executable size (about 70K codepoints mapped to 240K ASCII characters, using 450KB or memory, 160KB gzipped).
Examples
use deunicode::deunicode;
assert_eq!(deunicode("Æneid"), "AEneid");
assert_eq!(deunicode("étude"), "etude");
assert_eq!(deunicode("北亰"), "Bei Jing");
assert_eq!(deunicode("ᔕᓇᓇ"), "shanana");
assert_eq!(deunicode("げんまい茶"), "genmaiCha");
assert_eq!(deunicode("🦄☣"), "unicorn biohazard");
Guarantees and Warnings
Here are some guarantees you have when calling deunicode()
:
- The
String
returned will be valid ASCII; the decimal representation of everychar
in the string will be between 0 and 127, inclusive. - Every ASCII character (0x00 - 0x7F) is mapped to itself.
- All Unicode characters will translate to printable ASCII characters
(
\n
or characters in the range 0x20 - 0x7E).
There are, however, some things you should keep in mind:
- Some transliterations do produce
\n
characters. - Some Unicode characters transliterate to an empty string, either on purpose
or because
deunicode
does not know about the character. - Some Unicode characters are unknown and transliterate to
"[?]"
(or a custom placeholder, orNone
if you use a chars iterator). - Many Unicode characters transliterate to multi-character strings. For example, "北" is transliterated as "Bei".
- The transliteration is context-free, and not sophisticated enough to produce proper Chinese or Japanese. Han characters used in multiple languages are mapped to a single Mandarin pronounciation, and will be mostly illegible to Japanese readers. Transliteration can't handle cases where a single character has multiple possible pronounciations.
Unicode data
Text::Unidecode
by Sean M. Burke- Unicodey by Cal Henderson
- gh emoji
- any_ascii
For a detailed explanation on the rationale behind the original dataset, refer to this article written by Burke in 2001.