6 releases (1 stable)
1.0.0 | Oct 8, 2024 |
---|---|
0.5.0 | Sep 19, 2022 |
0.4.0 | Jul 26, 2022 |
0.3.0 | Jan 25, 2022 |
0.1.0 | Dec 18, 2019 |
#165 in Text processing
59,151 downloads per month
Used in 8 crates
(5 directly)
205KB
6K
SLoC
unicode-case-mapping
Fast mapping of a char
to lowercase, uppercase, titlecase, or its simple case folding
in Rust using Unicode 16.0 data.
Usage
fn main() {
assert_eq!(unicode_case_mapping::to_lowercase('İ'), ['i' as u32, 0x0307]);
assert_eq!(unicode_case_mapping::to_lowercase('ß'), ['ß' as u32, 0]);
assert_eq!(unicode_case_mapping::to_uppercase('ß'), ['S' as u32, 'S' as u32, 0]);
assert_eq!(unicode_case_mapping::to_titlecase('ß'), ['S' as u32, 's' as u32, 0]);
assert_eq!(unicode_case_mapping::to_titlecase('-'), [0; 3]);
assert_eq!(unicode_case_mapping::case_folded('I'), NonZeroU32::new('i' as u32));
assert_eq!(unicode_case_mapping::case_folded('ß'), None);
assert_eq!(unicode_case_mapping::case_folded('ẞ'), NonZeroU32::new('ß' as u32));
}
Motivation / When to Use
The Rust standard library supplies to_uppercase and to_lowercase methods on
char
so you might be wondering why this crate was created or when to use it.
You should almost certainly use the standard library, unless:
- You need support for titlecase conversion or case folding according to the Unicode character database (UCD).
- You need lower level access to the mapping table data, compared to the iterator interface supplied by the standard library.
- You need faster performance than the standard library.
An additional motivation for creating this crate was to be able to version the UCD data used independent of the Rust version. This allows us to ensure all our Unicode related crates are all using the same UCD version.
Performance & Implementation Notes
ucd-generate is used to generate tables.rs
. A build script (build.rs
)
compiles this into a three level look up table. The look up time is constant as
it is just indexing into the arrays.
The multi-level approach maps a code point to a block, then to a position within a block, which is then the index of a record describing how to map that codepoint to lower, upper, and title case. This allows the data to be deduplicated, saving space, whilst also providing fast lookup. The code is parameterised over the block size, which must be a power of 2. The value in the build script is optimal for the data set.
This approach trades off some space for faster lookups. The tables take up
about 101KiB. Benchmarks (run with cargo bench
) show this approach to be
~5–10× faster than the binary search approach used in the Rust standard
library.
It's possible there are further optimisations that could be made to eliminate some runs of repeated values in the first level array.
Regenerating tables.rs
- Regenerate with
yeslogic-ucd-generate
(runmake
). - Add/restore
#[allow(dead_code)]
to each table to prevent warnings.