3 releases

Uses old Rust 2015

0.7.3 May 20, 2020
0.7.2 Aug 9, 2019
0.7.1 Aug 9, 2019

#1457 in Text processing

Download history 4/week @ 2024-06-17 9/week @ 2024-07-01 20/week @ 2024-07-08 45/week @ 2024-07-29 4/week @ 2024-08-26 9/week @ 2024-09-09 23/week @ 2024-09-16 38/week @ 2024-09-23 5/week @ 2024-09-30

75 downloads per month

Apache-2.0/MIT

64KB
982 lines

Introduction

Two strategies are available:

Usage

Quickstart

The dictionaries can be built with:

cargo build -vv --features build_dictionaries

The resulting dictionaries are saved in the dictionaries directory.

You can then load and use a dictionary with:

use kl_hyphenate::{Standard, Hyphenator, Language, Load};

let path_to_dict = "dictionaries/en-us.standard.bincode";
let en_us = Standard::from_path(Language::EnglishUS, path_to_dict) ?;

// Identify valid breaks in the given word.
let hyphenated = en_us.hyphenate("hyphenation");

// Word breaks are represented as byte indices into the string.
let break_indices = &hyphenated.breaks;
assert_eq!(break_indices, &[2, 6, 7]);

// The segments of a hyphenated word can be iterated over.
let segments = hyphenated.into_iter().segments();
let collected : Vec<_> = segments.collect();
assert_eq!(collected, vec!["hy", "phen", "a", "tion"]);

// `hyphenate()` is case-insensitive.
let uppercase : Vec<_> = en_us.hyphenate("CAPITAL").into_iter().collect();
assert_eq!(uppercase, vec!["CAP-", "I-", "TAL"]);

Segmentation

Dictionaries can be used in conjunction with text segmentation to hyphenate words within a text run. This short example uses the unicode-segmentation crate for untailored Unicode segmentation.

use unicode_segmentation::UnicodeSegmentation;

let hyphenate_text = |text : &str| -> String {
    // Split the text on word boundaries—
    text.split_word_bounds()
        // —and hyphenate each word individually.
        .flat_map(|word| en_us.hyphenate(word).into_iter())
        .collect()
};

let excerpt = "I know noble accents / And lucid, inescapable rhythms; […]";
assert_eq!("I know no-ble ac-cents / And lu-cid, in-escapable rhythms; […]"
          , hyphenate_text(excerpt));

Normalization

Hyphenation patterns for languages affected by normalization occasionally cover multiple forms, at the discretion of their authors, but most often they don’t. If you require kl-hyphenate to operate strictly on strings in a known normalization form, as described by the Unicode Standard Annex #15 and provided by the unicode-normalization crate, you may specify it in your Cargo manifest, like so:

[dependencies.kl-hyphenate]
version = ""
features = ["nfc"]

The features field may contain exactly one of the following normalization options:

  • "nfc", for canonical composition;
  • "nfd", for canonical decomposition;
  • "nfkc", for compatibility composition;
  • "nfkd", for compatibility decomposition.

It is recommended to build kl-hyphenate in release mode if normalization is enabled, since the bundled hyphenation patterns will need to be reprocessed into dictionaries.

License

Dual-licensed under the terms of either:

  • the Apache License, Version 2.0
  • the MIT license

hyph-utf8 hyphenation patterns © their respective owners; see their master files for licensing information.

patterns/hyph-hu.ext.txt (extended Hungarian hyphenation patterns) is licensed under:

  • MPL 1.1 (refer to patterns/hyph-hu.ext.lic.txt)

patterns/hyph-ca.ext.txt (extended Catalan hyphenation patterns) is licensed under:

  • LGPL v.3.0 or higher (refer to patterns/hyph-ca.ext.lic.txt)

Dependencies

~0.7–1.4MB
~31K SLoC