#unicode #skeleton #confusable #unicode-text #unicode-characters #text

unicode_skeleton

This crate detects unicode strings that look nearly identical once rendered, but do not compare as equal. It defines "confusable" and "skeleton" based on Unicode Standard Annex #39

2 releases

Uses old Rust 2015

0.1.1 Oct 8, 2017
0.1.0 Oct 8, 2017

#1814 in Text processing


Used in 2 crates

MIT/Apache

160KB
105 lines

Unicode character "confusable" detection and "skeleton" computation, specified by the Unicode Standard Annex #39. These functions are for working with strings that appear nearly identical once rendered, but do not compare as equal.

Documentation

extern crate unicode_skeleton;

use unicode_skeleton::{UnicodeSkeleton, confusable};

fn main() {
    assert_eq!("𝔭𝒢ỿ𝕑𝕒ℓ".skeleton_chars().collect::<String>(), "paypal");
    assert!(confusable("β„π“Šπ“ˆπ“‰", "Rust"));
}

crates.io

Adding the following to your Cargo.toml to use:

[dependencies]
unicode_skeleton = "0.1.0"

lib.rs:

Transforms a unicode string by replacing unusual characters with similar-looking common characters, as specified by the Unicode Standard Annex #39. For example, "β„π“Šπ“ˆπ“‰" will be transformed to "Rust". This simplified string is called the "skeleton".

use unicode_skeleton::UnicodeSkeleton;

"β„π“Šπ“ˆπ“‰".skeleton_chars().collect::<String>() // "Rust"

Strings are considered "confusable" if they have the same skeleton. For example, "β„π“Šπ“ˆπ“‰" and "Rust" are confusable.

use unicode_skeleton::confusable;

confusable("β„π“Šπ“ˆπ“‰", "Rust") // true

The translation to skeletons is based on Unicode Security Mechanisms for UTR #39 version 10.0.0.

Dependencies

~1MB
~34K SLoC