2 unstable releases
Uses new Rust 2024
| 0.2.0 | Sep 6, 2025 |
|---|---|
| 0.1.0 | Sep 6, 2025 |
#683 in Text processing
535KB
10K
SLoC
GraphemeMachine: Grapheme Cluster Segmentation state machine in Rust
This is a Rust library implementaing of the Grapheme Cluster portion of UAX #29: Unicode Text Segmentation, which prioritizes streaming-friendliness and simplicity.
This library implements the segmentation algorithm as of Unicode 16.0.0, using the character database tables from that release.
For more information, refer to the API documentation.
lib.rs:
An implementation of the Grapheme Cluster portion of UAX #29: Unicode Text Segmentation that prioritizes streaming-friendliness and simplicity.
This library implements the segmentation algorithm as of Unicode 16.0.0, using the character database tables from that release.
GraphemeMachine is the main type in this library. Construct an object
of that type and then feed it characters from a stream one at a time, and
in return it will tell you for each new character whether it should be
treated as an extension of the current grapheme cluster or the beginning
of a new one. That's all there is to it!
The canonical Rust library for UAX #29 is
unicode_segmentation,
and so that's actually probably what you should use in most cases. This
library has the following main distinctions (as of
unicode_segmentation v1.12.0):
-
The primary entry point for grapheme clusters in
unicode_segmentationisGraphemes, which expects the entire text to be in memory as a single buffer.The library also offers
GraphemeCursorfor working with non-contiguous buffers, but it has a rather challenging API and is difficult to use in a completely streaming manner, with the caller required to sometimes provide earlier context to help it make a decision.By contrast,
GraphemeMachinein this library is a finite state machine that is advanced one character at a time, with no requirement for the caller to do any buffering at all. Of course in practice it's likely that a normal caller will need to at least buffer the current grapheme cluster so it can be used once finally split, but how to manage that is left entirely up to the caller.For example, a caller could decide that it only cares about grapheme clusters up to some reasonable maximum length, after which it will just assume malicious or corrupt input and use the Unicode replacement character instead. The
GraphemeMachinecan still allow that caller to find the end of that overlong grapheme cluster and begin consuming the next one even though the caller is no longer including any new characters into its buffer. -
unicode_segmentationfinds the relevant Unicode character properties for incoming characters using binary search over its internal tables, after converting the character into a Rustcharvalue.GraphemeMachineinstead prefers to work with UTF-8 encoded characters as represented byu8char, which can be more cheaply extracted from and appended to Rust strings. The character property lookup is done using a trie based on the UTF-8 byte sequence, and so is potentially faster when you're chomping UTF-8 sequences from astrbuffer one at a time.(That's not necessarily true, though. Measure it yourself with the text you want to segment if performance is important to you!)
-
Although
GraphemeMachinecan work withcharandu8charvalues representing specific characters, the segmentation algorithm is actually defined in terms of groups of characters that share similar properties.This library exposes those categories as part of its public API using
CharProperties,GCBProperty, andInCBProperty, and so it could be useful purely as a character property lookup library even if you don't useGraphemeMachine, or you could even choose to use your own tailored character property tables and passCharPropertiesvalues directly to aGraphemeMachineobject.
Unless you have a good reason to prefer this library though, it's probably
better to use
unicode_segmentation
because it's widely-used in the Rust community, well-maintained by an
established team (whereas this library has only a single,
easily-distracted author), and probably not subject to the important caveat
described in the following section.
An important caveat
The author originally wrote the code and lookup tables in this library internally within another project, and then proceeded to copy it into several other projects that needed grapheme cluster segmentation. This library is the result of finally getting around to separating it out into a separate unit for release.
Unfortunately the code that generated the trie used for character property
lookup seems to be missing, and so this library will probably be tethered
to Unicode 16.0.0 indefinitely unless the author gets somehow inspired
to recreate that generation program. 😖 If staying up-to-date with new
Unicode versions is important to you then you should probably use
unicode_segmentation
instead.
It would in principle be possible to use a property lookup table maintained
outside of this crate and then produce CharProperties values to pass
into a GraphemeMachine without using this library's lookup tables at
all, though I expect few would be motivated to do that.