3 releases (breaking)
0.3.0 | Aug 27, 2023 |
---|---|
0.2.0 | Aug 27, 2023 |
0.1.0 | Aug 19, 2023 |
#905 in Algorithms
177 downloads per month
Used in srgn
27KB
214 lines
Decompound
Split (decompose) a compound word into its constituent parts. Works in any language, as you provide the rules around what constitutes a (single) word. The algorithm is Unicode-aware.
Useful for culling down existing dictionaries at build time.
The docs are best viewed via docs.rs (but note those docs are tied to releases, so might not be on par with the head of trunk).
Usage
Usage is very straightforward. There is only one (free) function of interest,
decompound
. Its party piece is a closure argument, deciding whether a single word
is valid. As this can be highly dynamic and language-specific, this decision is left to
the user.
use decompound::{decompound, DecompositionOptions};
let is_valid_single_word = |w: &str| ["bed", "room"].contains(&w);
assert_eq!(
decompound(
"bedroom",
&is_valid_single_word,
DecompositionOptions::empty(),
).unwrap(),
vec!["bed", "room"]
);
Candidates for validity checks are simple dictionary lookups (for example, using
std::collections::HashSet
, phf
, Finite State
Transducers, binary
search, ...), or any elaborate algorithm of your
choice.
Configuration
Configuration is exposed as a bit
field via
DecompositionOptions
. It affords more complex use cases, freely combinable.
Usefulness largely depends on the natural language at hand. Some, for example German,
might require:
use decompound::{decompound, DecompositionError, DecompositionOptions};
let is_valid_single_word = |w: &str| ["Rüben", "Knollen", "Küche"].contains(&w);
assert_eq!(
decompound(
"Rübenknollen-Küche",
&is_valid_single_word,
// Wouldn't find anything without titlecasing `knollen` to `Knollen`,
// and splitting on hyphens.
DecompositionOptions::SPLIT_HYPHENATED
| DecompositionOptions::TRY_TITLECASE_SUFFIX
).unwrap(),
vec!["Rüben", "Knollen", "Küche"]
);
For more options and examples, refer to the docs of DecompositionOptions
.
Failure modes
If the word cannot be decomposed, a DecompositionError
is returned.
use decompound::{decompound, DecompositionError, DecompositionOptions};
let is_valid_single_word = |w: &str| ["water", "melon"].contains(&w);
assert_eq!(
decompound(
"snowball",
&is_valid_single_word,
DecompositionOptions::empty(),
).unwrap_err(),
DecompositionError::NothingValid
);
Overeager validity checks
Nothing prevents you from providing a closure which itself accepts compound words.
Compound words (like railroad
) being included in a lookup dictionary (instead of
only its root words rail
and road
) is an example "pathological" case.
Accommodating compound words yourself is precisely what this crate is supposed to
alleviate. If you already have and do not want to or cannot drop that
capability, this crate might be obsolete for your case (hence "overeager checks").
Although decompound
prefers splits if possible, such as
use decompound::{decompound, DecompositionError, DecompositionOptions};
// Contains a compound word *and* its root words.
let is_valid_single_word = |w: &str| ["blueberry", "blue", "berry"].contains(&w);
assert_eq!(
decompound(
"blueberry",
&is_valid_single_word,
DecompositionOptions::empty(),
).unwrap(),
vec!["blue", "berry"]
);
if root words are missing but the compound itself is present, decomposition technically
fails. This case is considered an error, and marked as such. That is more
ergonomic than
being returned a [Vec
] of constituents of length 1, requiring more awkward error
handling at the call site.
use decompound::{decompound, DecompositionError, DecompositionOptions};
// *Only* contains a compound word, not its root words.
let is_valid_single_word = |w: &str| ["firefly"].contains(&w);
assert_eq!(
decompound(
"firefly",
&is_valid_single_word,
DecompositionOptions::empty(),
).unwrap_err(),
DecompositionError::SingleWord("firefly".to_string())
);
Match on this variant if this case is not an error in your domain (this crate itself does so internally, too).
Motivation
The crate implementation is simple and nothing you wouldn't be able to write yourself.
There is a catch though. As mentioned, this crate can help you move checks for compound
words from static (a fixed dictionary) to runtime (decompound
). For some languages,
this is strictly required, as the set of compound words might be immense, or
(effectively, not mathematically) unbounded, meaning root words may be combined to
arbitrary lengths.
German is such a case. No dictionary exists to cover all possible German words. However, existing ones are almost guaranteed to themselves contain some compound words (which is generally helpful). When using such dictionaries and this crate to cover all remaining, arbitrary compound words, duplication arises, and the dictionary is no longer minimal. Most, perhaps all, compound words in the dictionary could be detected at runtime instead (providing a single source of truth along the way).
Culling the dictionary might lead to significant, perhaps necessary savings in size (memory and executable). For culling, a build script is needed. However, now both the code being built and the build script depend on that same detection algorithm. Where does it live?
-
The build script cannot depend on what it's building. Therefore, the algorithm cannot live inside the crate under construction.
-
The temptation of copy-pasting it into the build script arises. However, if what the dictionary is being culled with with gets out of sync with what's done at runtime, bugs arise. Maintainability suffers. Therefore, duplicating the algorithm is not an option.
-
Currently (2023-08-19), there is no place for the check to live except another crate, external to both the build script and actual code. That's this crate. It affords:
- a non-cyclic build graph,
- a single source of truth for the algorithm,
- usage of any dictionary, no out-of-band preprocessing necessary (the original dictionary can be kept).
Developing
The library crate contains an accompanying binary. It is accessible
simply via cargo run
. This allows running this crate locally, provided a word list on
stdin. For example:
$ sudo apt update && sudo apt install --yes wngerman
$ cat /usr/share/dict/ngerman | cargo run -- --try-titlecase-suffix --split-hyphenated 'Affengruppen-Überfall' 2>/dev/null
Affen
Gruppen
Überfall
Dependencies
~510KB