28 releases

Uses new Rust 2024

0.11.1	Feb 25, 2025
0.10.1	Dec 14, 2024
0.10.0	Feb 7, 2024
0.9.0	Mar 14, 2023
0.4.0	Jul 26, 2022

#226 in Algorithms

160 downloads per month

MIT license

510KB
878 lines

feruca – Unicode collation in Rust

feruca is a simple, from-scratch implementation of the Unicode Collation Algorithm in Rust. It's current with Unicode v16 (and, correspondingly, CLDR v46.1; see below). The name of the library is a portmanteau of Ferris 🦀 and UCA.

No unsafe is used directly in this library: #![forbid(unsafe_code)]. It relies on the well-vetted bstr to accept input (in the form of either &str or &[u8]), to perform UTF-8 validation, and to generate a list of Unicode scalar values, which can then be processed for collation. The idea is to be tolerant of input that may not be entirely kosher UTF-8.

In describing feruca as a "simple implementation," I have a few things in mind. First, the performance of the library could perhaps still be improved—at least, in comparison to the official C implementation, ucol from icu4c, which is incredibly optimized. I no longer run benchmarks against that implementation, but feruca was always slower, and my guess is that it still is (though not severely). What I do currently benchmark against is the newer first-party implementation belonging to the icu4x project, which is also written in Rust. feruca performs on the order of 4–5x faster than the icu4x collator—while having a much smaller feature set. My priority as a solo dev was to produce a relatively bare-bones implementation that passes the official UCA conformance tests, as well as the tests for the "root collation order" of the Common Locale Data Repository (CLDR).

Second, support for tailoring is minimal (so far). You can choose between two tables of character weights: the Default Unicode Collation Element Table (DUCET), or the CLDR variant thereof. The CLDR table then becomes the starting point for actual collation tailoring based on language/locale. I have added only two tailorings, both intended for use with Arabic-script languages. One of them shifts letters in the Arabic script so that, as a block, they sort before the Latin script. The other tailoring attempts to interleave the Latin and Arabic scripts, so that alif sorts after A and before B; bā’ sorts after B and before C; etc. This is enough for my own work with Persian and Arabic texts. The CLDR table in its unmodified form—i.e., the root collation order—works out-of-the-box for several other languages. I do hope to add more tailorings, but it will be a gradual process, and driven by demand. Realistically, feruca will never have the kind of all-encompassing, flexible support for tailoring that is provided by ICU. My feeling is that there is a place for less sophisticated solutions, with simpler APIs, smaller dependency trees, etc. (If you have thoughts on this, I would be interested in hearing them.)

Apart from locale tailoring, you can choose between the "non-ignorable" and "shifted" strategies for handling variable-weight characters—with the latter being the default. There is also an option to use byte-value comparison as a "tiebreaker" in cases where two strings produce identical UCA sort keys.

Third, this library has effectively just one public method, collate, belonging to a struct, Collator, which sets the options. collate accepts two string references or byte slices, and returns an Ordering value. It is designed to be passed as a comparator to the standard library method sort_by (or sort_unstable_by). See "Example usage" below.

For many people and use cases, UCA sorting will not work properly without being able to specify a locale! Again, however, it is worth emphasizing the usefulness of the CLDR root collation order on its own. When defining a Collator, you can set the default options (see below), which indicate the use of the CLDR table with the "shifted" strategy. I think this is a good starting point.

Example usage

use feruca::Collator;

fn main() {
    let mut uca = [
        "چنگیز",
        "Éloi",
        "Ötzi",
        "Melissa",
        "صدام",
        "Mélissa",
        "Overton",
        "Elrond",
    ];

    let mut naive = uca;
    naive.sort_unstable();

    let mut collator = Collator::default();
    uca.sort_unstable_by(|a, b| collator.collate(a, b));

    for item in uca {
        println!("{item}");
    }
    // Éloi
    // Elrond
    // Melissa
    // Mélissa
    // Ötzi
    // Overton
    // چنگیز
    // صدام

    println!(); // Empty line for clarity

    for item in naive {
        println!("{item}");
    }
    // Elrond
    // Melissa
    // Mélissa
    // Overton
    // Éloi
    // Ötzi
    // صدام
    // چنگیز
}

Conformance

The UCA conformance tests can be run with the command cargo test --release. Please note that, as a result of this library's reliance on bstr for UTF-8 validation, any surrogate code points found in input to the collate method will be converted to the standard "replacement character," U+FFFD. Conformant implementations of the UCA are explicitly allowed to follow this approach. It does mean, however, that a handful of lines (out of hundreds of thousands) in the conformance tests need to be skipped. If you look at the conformance function in the tests module, you will see that any line containing a surrogate code point is passed over.

Bincode

The binary files included with feruca represent hash tables of Unicode data. They are generated in a separate repository, feruca-mapper, and serialized using bincode. You can rebuild them yourself, if you prefer.

Licensing

The text files in the test-data directory are covered by the Unicode License Agreement. Everything else is MIT-licensed.

Dependencies

~0.9–1.3MB
~23K SLoC