#unicode-characters #unicode #table #generate-table #character #generate #fst

app yeslogic-ucd-generate

A program for generating packed representations of the Unicode character database that can be efficiently searched with support for additional tables

7 releases (4 breaking)

0.7.0 Oct 10, 2024
0.6.0 Sep 16, 2022
0.5.0 Jun 4, 2021
0.4.2 Nov 17, 2020
0.3.0 Jan 9, 2020

#101 in Text processing

MIT/Apache

710KB
7K SLoC

yeslogic-ucd-generate

A command line tool to generate Unicode tables in Rust source code. Tables can typically be generated in one of three formats: a sorted sequence of character ranges, a finite state transducer or a compressed trie. Full support for name canonicalization is also provided.


This version of ucd-generate adds the following on top of BurntSushi's version:

  • joining-group sub-command. #32

Upstream README follows:


Installation

Since this is mostly intended as a developer tool for use while writing Rust programs, the principle method of installation is from crates.io:

$ cargo install yeslogic-ucd-generate
yeslogic-ucd-generate --help

Example

This somewhat arbitrary example shows the output of generating tables for three properties, and representing them as normal Rust character literal ranges.

To run the example, you need to download the Unicode Character Database (UCD):

$ mkdir /tmp/ucd-15.0.0
$ cd /tmp/ucd-15.0.0
$ curl -LO https://www.unicode.org/Public/zipped/15.0.0/UCD.zip
$ unzip UCD.zip

Note that prior to version 13.0.0, emoji/emoji-data.txt file was distributed separate from the UCD bundle. For these versions, you may need to download this file from https://unicode.org/Public/emoji in order to generate certain tables.

Now tell ucd-generate what you want and point it to the directory created above:

$ ucd-generate property-bool /tmp/ucd-15.0.0 --include Hyphen,Dash,Quotation_Mark --chars

And the output, which is valid Rust source code:

// DO NOT EDIT THIS FILE. IT WAS AUTOMATICALLY GENERATED BY:
//
//   ucd-generate property-bool /tmp/ucd-15.0.0 --include Hyphen,Dash,Quotation_Mark --chars
//
// Unicode version: 15.0.0.
//
// ucd-generate 0.2.10 is available on crates.io.

pub const BY_NAME: &'static [(&'static str, &'static [(char, char)])] = &[
  ("Dash", DASH), ("Hyphen", HYPHEN), ("Quotation_Mark", QUOTATION_MARK),
];

pub const DASH: &'static [(char, char)] = &[
  ('-', '-'), ('֊', '֊'), ('־', '־'), ('', ''), ('', ''),
  ('', ''), ('', ''), ('', ''), ('', ''),
  ('', ''), ('', ''), ('', ''), ('', ''),
  ('', ''), ('\u{2e5d}', '\u{2e5d}'), ('', ''), ('', ''),
  ('', ''), ('', ''), ('', ''), ('', ''),
  ('', ''), ('𐺭', '𐺭'),
];

pub const HYPHEN: &'static [(char, char)] = &[
  ('-', '-'), ('\u{ad}', '\u{ad}'), ('֊', '֊'), ('', ''),
  ('', ''), ('', ''), ('', ''), ('', ''),
  ('', ''), ('', ''),
];

pub const QUOTATION_MARK: &'static [(char, char)] = &[
  ('"', '"'), ('\'', '\''), ('«', '«'), ('»', '»'), ('', ''),
  ('', ''), ('', ''), ('', ''), ('', ''),
  ('', ''), ('', ''), ('', ''), ('', ''),
];

DFA serialization

Prior to ucd-generate 0.3.0, the sub-commands dfa and regex could be used to build fully compiled DFAs, serialize them to disk and generate Rust code for deserializing them. This functionality was removed in 0.3.0 and moved to regex-cli.

Contributing

The ucd-generate tool doesn't have any specific design goals, other than to collect Unicode table generation tasks. If you need ucd-generate to do something and it's reasonably straight-forward to add, then just submitting a PR would be great. Otherwise, file an issue and we can discuss.

Alternatives

The primary alternative is ICU4X. If you have sophisticated Unicode requirements, it is almost certainly what you should be using.

It's beyond the scope of this README to do a full comparison between ICU4X and ucd-generate, but I think the shortest way to describe it is that ucd-generate is simplistic, with all the associated positive and negative connotations that come with that word.

Future work

This tool is by no means is exhaustive. In fact, it's not even close to exhaustive, and it may never be. For the most part, the intent of this tool is to collect virtually any kind of Unicode generation task. In theory, this would ideally replace the hodge podge collection of Python programs that is responsible for this task today in various Unicode crates.

Here are some examples of future work that would be welcome:

  • More support for parsing things in the UCD.
  • More generation tasks based on things in the UCD.
  • More output formats, especially for reducing binary size.

Sub-crates

This repository is home to three sub-crates:

  • ucd-parse - A crate for parsing UCD files into structured data.
  • ucd-trie - Auxiliary type for handling the trie set table format emitted by ucd-generate. This crate has a no_std mode.
  • ucd-util - A purposely small crate for Unicode auxiliary functions. This includes things like symbol or character name canonicalization, ideograph name generation and helper functions for searching property name and value tables.

License

This project is licensed under either of

Dependencies

~4MB
~46K SLoC