11 releases
0.1.10 | May 23, 2022 |
---|---|
0.1.9 | May 23, 2022 |
0.1.6 | Sep 19, 2021 |
0.1.5 | Aug 15, 2021 |
#960 in Rust patterns
Used in runestr-pancjkv
130KB
3K
SLoC
rune
, RuneStr
and RuneString
User-perceived characters type rune
and its related types and data structures.
Example
use runestr::{rune, RuneString};
fn main() {
let runestr = RuneString::from_str_lossy("\u{0041}\u{0341}\u{304B}\u{3099}\u{9508}");
assert_eq!(3, runestr.runes().count());
}
License
Licensed under either of Apache License, Version 2.0 or MIT license at your option.Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in this crate by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.
lib.rs
:
User-perceived characters related types and data structures.
The rune
type represents a user-perceived character. It roughly corresponds to a Unicode grapheme cluster
but with some nice properties. Runes that consists of two or more char
s are automatically registered on a
per-thread basis. This also means that rune
s are neither Send
nor Sync
.
The RuneStr
type, also called a "rune string slice", is a primitive rune-based string type.
It is usally seen in its borrowed form, &RuneStr
.
Rune string slices are encoded in a special encoding called FSS-UTF
, which is a super-set of UTF-8 encoding.
This allows all rune
s be encoded.
The RuneString
type, is a growable rune-based string type.
Rune definition
Our rune definition is based on the extended grapheme cluster defined within UAX-29. On top of this, we will convert all the CJK Compatibility Ideographs to their equivalent IVS form, and then convert the text to NFC form. We also apply a few specfic "lossy conversion" rules when necessary. The rules are defined below, and their goal to make each of the rune "standalone", that is, when two runes are put next to one each other, they won't automatically merge together into one larger rune.
Rules for lossy conversion within a rune
- An orphan abstract character CR (U+000D) is converted into CR LF sequence.
- If a hangul-syllable doesn't contain CHOSEONG or JUNGSEONG jamos, corresponding filter (U+115F, U+1160) will be automatically added.
- An orphan Regional Indicator (U+1F1E6..U+1F1FF) abstract character is automatically appended another copy to make it no longer orphan.
- An Extended Pictographic sequence that ends with the abstract character ZWJ (U+200D) with an optional sequence of continuing characters before it, will get another extra ZWJ (U+200D) abstract character to prevent it merging with next rune to form a larger Extended Pictographic sequence.
- If no base character is provided, a space (U+0020) character is inserted.
Dependencies
~1.5MB
~40K SLoC