3 releases (breaking)
0.9.0 | Mar 3, 2019 |
---|---|
0.8.0 | Jan 2, 2019 |
0.7.0 | Feb 7, 2018 |
#787 in Text processing
63,994 downloads per month
Used in 69 crates
(16 directly)
79KB
1K
SLoC
UNIC: Unicode and Internationalization Crates for Rust
https://github.com/open-i18n/rust-unic
UNIC is a project to develop components for the Rust programming language to provide high-quality and easy-to-use crates for Unicode and Internationalization data and algorithms. In other words, it's like ICU for Rust, written completely in Rust, mostly in safe mode, but also benefiting from performance gains of unsafe mode when possible.
See UNIC Changelog for latest release details.
Project Goal
The goal for UNIC is to provide access to all levels of Unicode and Internationalization functionalities, starting from Unicode character properties, to Unicode algorithms for processing text, and more advanced (locale-based) processes based on Unicode Common Locale Data Repository (CLDR).
Other standards and best practices, like IETF RFCs, are also implemented, as needed by Unicode/CLDR components, or common demand.
Project Status
At the moment UNIC is under heavy development: the API is updated frequently on
master
branch, and there will be API breakage between each 0.x
release.
Please see open issues for changes
planed.
We expect to have the 1.0
version released in 2018 and maintain a stable API
afterwards, with possibly one or two API updates per year for the first couple
of years.
Design Goals
-
Primary goal of UNIC is to provide reliable functionality by way of easy-to-use API. Therefore, new components are added may not be well-optimized for performance, but will have enough tests to show conformance to the standard, and examples to show users how they can be used to address common needs.
-
Next major goal for UNIC components is performance and low binary and memory footprints. Specially, optimizing runtime for ASCII and other common cases will encourage adaptation without fear of slowing down regular development processes.
-
Components are guaranteed, to the extend possible, to provide consistent data and algorithms. Cross-component tests are used to catch any inconsistency between implementations, without slowing down development processes.
Components and their Organization
UNIC Components have a hierarchical organization, starting from the
unic
root, containing the major components. Each major component, in
turn, may host some minor components.
API of major components are designed for the end-users of the libraries, and are expected to be extensively documented and accompanies with code examples.
In contrast to major components, minor components act as providers of data and algorithms for the higher-level, and their API is expected to be more performing, and possibly providing multiple ways of accessing the data.
The UNIC Super-Crate
The unic
super-crate is a collection of all
UNIC (major) components, providing an easy way of access to all functionalities,
when all or many are needed, instead of importing components one-by-one. This
crate ensures all components imported are compatible in algorithms and
consistent data-wise.
Main code examples and cross-component integration tests are implemented under this crate.
Major Components
-
unic-char
: Unicode Character Tools. -
unic-normal
: Unicode Normalization Forms (UAX#15). -
unic-segment
: Unicode Text Segmentation Algorithms (UAX#29). -
unic-emoji
: Unicode Emoji (UTS#51).
Applications
unic-cli
: UNIC Command-Line Tools
Code Organization: Combined Repository
Some of the reasons to have a combined repository these components are:
-
Faster development. Implementing new Unicode/i18n components very often depends on other (lower level) components, which in turn may need adjustments—expose new API, fix bugs, etc—that can be developed, tested and reviewed in less cycles and shorter times.
-
Implementation Integrity. Multiple dependencies on other components mean that the components need to, to some level, agree with each other. Many Unicode algorithms, composed from smaller ones, assume that all parts of the algorithm is using the same version of Unicode data. Violation of this assumption can cause inconsistencies and hard-to-catch bugs. In a combined repository, it's possible to reach a better integrity during development, as well as with cross-component (integration) tests.
-
Pay for what you need. Small components (basic crates), which cross-depend only on what they need, allow users to only bring in what they consume in their project.
-
Shared bootstrapping. Considerable amount of extending Unicode/i18n functionalities depends on converting source Unicode/locale data into structured formats for the destination programming language. In a combined repository, it's easier to maintain these bootstrapping tools, expand coverage, and use better data structures for more efficiency.
Documentation
- Unicode and Rust
- UNIC Versioning
- UNIC Unicode API
- UNIC API Guideline
- UNIC API Reference (autogenerated on docs.rs)
How to Use UNIC
In Cargo.toml
:
[dependencies]
unic = "0.9.0" # This has Unicode 10.0.0 data and algorithms
And in main.rs
:
extern crate unic;
use unic::ucd::common::is_alphanumeric;
use unic::bidi::BidiInfo;
use unic::normal::StrNormalForm;
use unic::segment::{GraphemeIndices, Graphemes, WordBoundIndices, WordBounds, Words};
use unic::ucd::normal::compose;
use unic::ucd::{is_cased, Age, BidiClass, CharAge, CharBidiClass, StrBidiClass, UnicodeVersion};
fn main() {
// Age
assert_eq!(Age::of('A').unwrap().actual(), UnicodeVersion { major: 1, minor: 1, micro: 0 });
assert_eq!(Age::of('\u{A0000}'), None);
assert_eq!(
Age::of('\u{10FFFF}').unwrap().actual(),
UnicodeVersion { major: 2, minor: 0, micro: 0 }
);
if let Some(age) = '🦊'.age() {
assert_eq!(age.actual().major, 9);
assert_eq!(age.actual().minor, 0);
assert_eq!(age.actual().micro, 0);
}
// Bidi
let text = concat![
"א",
"ב",
"ג",
"a",
"b",
"c",
];
assert!(!text.has_bidi_explicit());
assert!(text.has_rtl());
assert!(text.has_ltr());
assert_eq!(text.chars().nth(0).unwrap().bidi_class(), BidiClass::RightToLeft);
assert!(!text.chars().nth(0).unwrap().is_ltr());
assert!(text.chars().nth(0).unwrap().is_rtl());
assert_eq!(text.chars().nth(3).unwrap().bidi_class(), BidiClass::LeftToRight);
assert!(text.chars().nth(3).unwrap().is_ltr());
assert!(!text.chars().nth(3).unwrap().is_rtl());
let bidi_info = BidiInfo::new(text, None);
assert_eq!(bidi_info.paragraphs.len(), 1);
let para = &bidi_info.paragraphs[0];
assert_eq!(para.level.number(), 1);
assert_eq!(para.level.is_rtl(), true);
let line = para.range.clone();
let display = bidi_info.reorder_line(para, line);
assert_eq!(
display,
concat![
"a",
"b",
"c",
"ג",
"ב",
"א",
]
);
// Case
assert_eq!(is_cased('A'), true);
assert_eq!(is_cased('א'), false);
// Normalization
assert_eq!(compose('A', '\u{030A}'), Some('Å'));
let s = "ÅΩ";
let c = s.nfc().collect::<String>();
assert_eq!(c, "ÅΩ");
// Segmentation
assert_eq!(
Graphemes::new("a\u{310}e\u{301}o\u{308}\u{332}").collect::<Vec<&str>>(),
&["a\u{310}", "e\u{301}", "o\u{308}\u{332}"]
);
assert_eq!(
Graphemes::new("a\r\nb🇺🇳🇮🇨").collect::<Vec<&str>>(),
&["a", "\r\n", "b", "🇺🇳", "🇮🇨"]
);
assert_eq!(
GraphemeIndices::new("a̐éö̲\r\n").collect::<Vec<(usize, &str)>>(),
&[(0, "a̐"), (3, "é"), (6, "ö̲"), (11, "\r\n")]
);
assert_eq!(
Words::new(
"The quick (\"brown\") fox can't jump 32.3 feet, right?",
|s: &&str| s.chars().any(is_alphanumeric),
).collect::<Vec<&str>>(),
&["The", "quick", "brown", "fox", "can't", "jump", "32.3", "feet", "right"]
);
assert_eq!(
WordBounds::new("The quick (\"brown\") fox").collect::<Vec<&str>>(),
&["The", " ", "quick", " ", "(", "\"", "brown", "\"", ")", " ", " ", "fox"]
);
assert_eq!(
WordBoundIndices::new("Brr, it's 29.3°F!").collect::<Vec<(usize, &str)>>(),
&[
(0, "Brr"),
(3, ","),
(4, " "),
(5, "it's"),
(9, " "),
(10, "29.3"),
(14, "°"),
(16, "F"),
(17, "!")
]
);
}
You can find more examples under examples
and tests
directories. (And more to be added as UNIC expands...)
License
Licensed under either of
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.
Contribution
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.
Code of Conduct
UNIC project follows The Rust Code of Conduct. You can find a copy of it in CODE_OF_CONDUCT.md or online at https://www.rust-lang.org/conduct.html.
lib.rs
:
UNIC — Unicode Emoji — Emoji Character Properties
A component of unic
: Unicode and Internationalization Crates for Rust.