11 releases

0.1.10 May 23, 2022
0.1.9 May 23, 2022
0.1.6 Sep 19, 2021
0.1.5 Aug 15, 2021

#960 in Rust patterns


Used in runestr-pancjkv

MIT/Apache

130KB
3K SLoC

rune, RuneStr and RuneString

User-perceived characters type rune and its related types and data structures.


Example

use runestr::{rune, RuneString};

fn main() {
    let runestr = RuneString::from_str_lossy("\u{0041}\u{0341}\u{304B}\u{3099}\u{9508}");
    assert_eq!(3, runestr.runes().count());
}

License

Licensed under either of Apache License, Version 2.0 or MIT license at your option.
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in this crate by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

lib.rs:

User-perceived characters related types and data structures.

The rune type represents a user-perceived character. It roughly corresponds to a Unicode grapheme cluster but with some nice properties. Runes that consists of two or more chars are automatically registered on a per-thread basis. This also means that runes are neither Send nor Sync.

The RuneStr type, also called a "rune string slice", is a primitive rune-based string type. It is usally seen in its borrowed form, &RuneStr.

Rune string slices are encoded in a special encoding called FSS-UTF, which is a super-set of UTF-8 encoding. This allows all runes be encoded.

The RuneString type, is a growable rune-based string type.

Rune definition

Our rune definition is based on the extended grapheme cluster defined within UAX-29. On top of this, we will convert all the CJK Compatibility Ideographs to their equivalent IVS form, and then convert the text to NFC form. We also apply a few specfic "lossy conversion" rules when necessary. The rules are defined below, and their goal to make each of the rune "standalone", that is, when two runes are put next to one each other, they won't automatically merge together into one larger rune.

Rules for lossy conversion within a rune

  • An orphan abstract character CR (U+000D) is converted into CR LF sequence.
  • If a hangul-syllable doesn't contain CHOSEONG or JUNGSEONG jamos, corresponding filter (U+115F, U+1160) will be automatically added.
  • An orphan Regional Indicator (U+1F1E6..U+1F1FF) abstract character is automatically appended another copy to make it no longer orphan.
  • An Extended Pictographic sequence that ends with the abstract character ZWJ (U+200D) with an optional sequence of continuing characters before it, will get another extra ZWJ (U+200D) abstract character to prevent it merging with next rune to form a larger Extended Pictographic sequence.
  • If no base character is provided, a space (U+0020) character is inserted.

Dependencies

~1.5MB
~40K SLoC