#unicode-characters #hebrew #unicode-text #utf-8 #low-level

hebrew_unicode_script

A lightweight library to check if a hebrew character belongs to certain collections

7 unstable releases (3 breaking)

0.4.2 Sep 8, 2024
0.4.1 Aug 30, 2024
0.3.2 Aug 12, 2024
0.2.0 Jul 20, 2024
0.1.1 Jul 20, 2024

#566 in Text processing

Download history 103/week @ 2024-07-11 214/week @ 2024-07-18 42/week @ 2024-07-25 1/week @ 2024-08-01 107/week @ 2024-08-08 22/week @ 2024-08-15 111/week @ 2024-08-22 215/week @ 2024-08-29 145/week @ 2024-09-05 35/week @ 2024-09-12 25/week @ 2024-09-19 13/week @ 2024-09-26 4/week @ 2024-10-03

78 downloads per month
Used in hebrew_unicode_utils

MIT/Apache

59KB
579 lines

Hebrew_Unicode_Script

Crates.io License Crates.io Version docs.rs Build & Test Clippy Analyze

Table of contents

Description

This crate (hebrew_unicode_script) is a low level library written in Rust and designed to facilitate the identification and validation of Unicode characters related to the Hebrew script and its associated unicode code blocks.

This library provides two types of interface:

  1. functions
  2. trait (the same functions behind one trait)

The given set of functions (either direct of via a trait) allow developers to easily determine whether a particular character belongs to the Hebrew unicode script, falls within the Hebrew unicode code block or matches the Alphabetic Presentation Form unicode code block.

For both of the applicable unicode code blocks there are some additional functions, allowing an even more refined check for the character type within each code block. Examples include vowels, accents, marks, etc.

Each function in this library returns a boolean value, making it easy to integrate these checks into existing or new applications.

For an overview of released versions see releases.

Function overview

Unicode script 'Hebrew' (top level):

  - is_script_hbr(c: char) -> bool
  - is_script_hbr_point(c: char) -> bool
  - is_script_hbr_consonant(c: char) -> bool
  - is_script_hbr_ligature_yiddisch(c: char) -> bool

Unicode block: 'Hebrew'

2nd level:

  - is_hbr_accent(c: char) -> bool
  - is_hbr_mark(c: char) -> bool
  - is_hbr_point(c: char) -> bool
  - is_hbr_punctuation(c: char) -> bool
  - is_hbr_consonant(c: char) -> bool
  - is_hbr_yod_triangle(c: char) -> bool
  - is_hbr_ligature_yiddish(c: char) -> bool

3rd level:

    - is_hbr_point_vowel(c) -> bool
    - is_hbr_point_semi_vowel(c) -> bool
    - is_hbr_point_reading_sign(c) -> bool
    - is_hbr_consonant_normal(c: char) -> bool
    - is_hbr_consonant_final(c: char) -> bool

Unicode block: 'Alphabetic Presentation Form'

2nd level:

  - is_apf_block(c: char) -> bool
  - is_apf_point_reading_sign(c: char) -> bool
  - is_apf_consonant(c: char) -> bool
  - is_apf_ligature_yiddisch(c: char) -> bool
  - is_apf_ligature(c: char) -> bool

3rd level:

    - is_apf_consonant_alternative(c: char) -> bool
    - is_apf_consonant_wide(c: char) -> bool
    - is_apf_consonant_with_vowel(c: char) -> bool

Notes

  • Hebrew points can be subdivided in:

    • Vowels (code points: U+05B4 .. U+05BB, U+05C7)
    • Semi-Vowels (code points: U+05B0 .. U+05B3)
    • Reading Signs (code points: U+05BC .. U+05BD + U+05BF + U+05C1 .. U+05C2) [^judeo-spanish] [^judeo-spanish]: For me it not clear yet if the 'HEBREW POINT JUDEO-SPANISH VARIKA' a reading sign or not. For the time being this code-point will be part of the reading signs
  • Hebrew letters (consonants) can be subdivided in:

    • Normal consonants (code points: U+05D0 .. U+05D9, U+05DB, U+05DC, U+05DE, U+05E0 .. U+05E2, U+05E4, U+05E6 .. U+05EA)
    • Final consonants (code points: U+05DA, U+05DD, U+05DF, U+05E3 and U+05E5)
    • Wide consonants (code points: U+FB21 .. U+FB28)
    • Consonants with vowel (code points: U+FB2A .. U+FB36, U+FB38 .. U+FB3C, U+FB3E, U+FB40, U+FB41, U+FB43, U+FB44, U+FB46 .. U+FB4E)
    • Alternative consonants (code points: U+FB20, U+FB29)

^ TOC

Examples

Using the function API

Basic usage:

use hebrew_unicode_script::is_hbr_block;

if is_hbr_block('מ') {
	println!("The character you entered is part of the 'unicode code block Hebrew'");
}
use hebrew_unicode_script::{is_hbr_consonant_final, is_hbr_consonant};

let test_str = "ךםןףץ";
for c in test_str.chars() {
    assert!(is_hbr_consonant_final(c));
    assert!(is_hbr_consonant(c));
}

A more complex example:

use hebrew_unicode_script::{is_hbr_accent,is_hbr_mark, is_hbr_point, is_hbr_punctuation};
use hebrew_unicode_script::{is_hbr_consonant_final,is_hbr_yod_triangle,is_hbr_ligature_yiddish};

fn main() {
   // define a strings of characters
   let string_of_chars = "יָ֭דַעְתָּ שִׁבְתִּ֣י abcdefg וְקוּמִ֑י";
   // get a structures that indicates if a type is present or not (bool)
   let chartypes = get_character_types(string_of_chars);
   // print the results
   println!("The following letter types are found in: {}", string_of_chars);
   println!("{:?}",chartypes);
}

#[derive(Debug, Default)]
pub struct HebrewCharacterTypes {
    accent: bool,
    mark: bool,
    point: bool,
    punctuation: bool,
    letter: bool,
    letter_normal: bool,
    letter_final: bool,
    yod_triangle: bool,
    ligature_yiddish: bool,
    whitespace: bool,
    non_hebrew: bool,
}

impl HebrewCharacterTypes {
    fn new() -> Self {
        Default::default()
    }
}

pub fn get_character_types(s: &str) -> HebrewCharacterTypes {
    let mut found_character_types = HebrewCharacterTypes::new();
    for c in s.chars() {
        match c {
            c if is_hbr_accent(c) => found_character_types.accent = true,
            c if is_hbr_mark(c) => found_character_types.mark = true,
            c if is_hbr_point(c) => found_character_types.point = true,
            c if is_hbr_punctuation(c) => found_character_types.punctuation = true,
            c if is_hbr_consonant_final(c) => found_character_types.letter_final = true,
            c if is_hbr_yod_triangle(c) => found_character_types.yod_triangle = true,
            c if is_hbr_ligature_yiddish(c) => found_character_types.ligature_yiddish = true,
            c if c.is_whitespace() => found_character_types.whitespace = true,
            _ => found_character_types.non_hebrew = true,
        }
    }
    found_character_types.letter =
        found_character_types.letter_normal | found_character_types.letter_final;
    found_character_types
}

Output result:

The following character types were found:
HebrewCharacterTypes {
    accent: true,
    mark: false,
    point: true,
    punctuation: false,
    letter: true,
    letter_normal: true,
    letter_final: false,
    yod_triangle: false,
    ligature_yiddish: false,
    whitespace: true,
    non_hebrew: true,
}

Using the trait API

use hebrew_unicode_script::HebrewUnicodeScript;

assert!( 'מ'.is_script_hbr() );
assert!( !'מ'.is_script_hbr_point() );
assert!( 'מ'.is_script_hbr_consonant() );
assert!( 'ױ'.is_script_hbr_ligature_yiddisch() );
assert!( 'מ'.is_hbr_block() );
assert!( !'מ'.is_hbr_accent() );
assert!( !'מ'.is_hbr_mark() );
assert!( !'מ'.is_hbr_point() );
assert!( !'מ'.is_hbr_point_vowel() );
assert!( !'מ'.is_hbr_point_semi_vowel() );
assert!( '\u{05BF}'.is_hbr_point_reading_sign() );
assert!( '\u{05BE}'.is_hbr_punctuation() );
assert!( 'ץ'.is_hbr_consonant() );
assert!( !'ץ'.is_hbr_consonant_normal() );
assert!( 'ץ'.is_hbr_consonant_final() );
assert!( '\u{05EF}'.is_hbr_yod_triangle() );
assert!( !'מ'.is_hbr_ligature_yiddish() );
assert!( !'מ'.is_apf_block() );
assert!( !'מ'.is_apf_point_reading_sign() );
assert!( !'מ'.is_apf_consonant() );
assert!( !'מ'.is_apf_consonant_alternative() );
assert!( !'מ'.is_apf_consonant_wide() );
assert!( !'מ'.is_apf_consonant_with_vowel() );
assert!( !'מ'.is_apf_ligature_yiddisch() );
assert!( !'מ'.is_apf_ligature() );

See the crate modules for more examples.

^ TOC

Install

For installation see the hebrew_unicode_script page at crates.io.

^ TOC

References

Unicode Script

Unicode Block Names

  1. Hebrew
  2. Alphabetic Presentation Form

Learn more about Unicode, Unicode scripts and Unicode code point blocks

Unicode Problems for Hebrew

There are some issues with unicode and Hebrew. These are described on the following web page: Unicode Problems

^ TOC

Safety

All functions are written in safe Rust.

^ TOC

Panics

Not that I am aware of.

^ TOC

Errors

All (trait)functions return either true or false.

^ TOC

Code Coverage

Current code coverage is 100% [^code coverage] [^code coverage]: The code coverage figures shown in crates.io are (very) different!

Code Coverage

I used code coverage, running locally

Actions:

  1. Install the extension Coverage Gutters
  2. Execute: cargo clean && mkdir -p target/coverage/html
  3. Execute: CARGO_INCREMENTAL=0 RUSTFLAGS='-Cinstrument-coverage' LLVM_PROFILE_FILE='cargo-test-%p-%m.profraw' cargo test
    • result -> (new file) cargo-test-67187-8558864636421498001_0.profraw (on my system)

Option 1: Using Coverage Gutters

  1. Execute: grcov . --binary-path ./target/debug/deps/ -s . -t lcov --branch --ignore-not-existing --ignore '../*' --ignore "/*" -o target/coverage/tests.lcov

    • result -> (new file) tests.lcov
  2. Click on the Watch button (added to VSCodium by the Ext)

    • result -> red/green indications will appear for each line of code

Option 2: Creating a webpage

  1. Execute: grcov . --binary-path ./target/debug/deps/ -s . -t html --branch --ignore-not-existing --ignore '../*' --ignore "/*" -o target/coverage/html
    • result -> a new directory called html
  2. Open the file index.html in the folder html in your brower and you get a full report.

^ TOC

License

Licensed under either of Apache License, Version 2.0 or MIT license at your option.

^ TOC

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in this crate by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

^ TOC

No runtime deps