#phonetic #algorithm #soundex #apache #port #metaphone #caverphone

rphonetic

Rust port of phonetic Apache commons-codec algorithms

13 stable releases

2.2.0 Apr 5, 2024
2.1.5 Feb 28, 2024
2.1.4 Jan 30, 2024
2.1.2 Nov 26, 2023
1.2.0 Jul 31, 2022

#70 in Text processing

Download history 1/week @ 2024-01-12 26/week @ 2024-01-19 48/week @ 2024-01-26 25/week @ 2024-02-02 8/week @ 2024-02-09 22/week @ 2024-02-16 166/week @ 2024-02-23 156/week @ 2024-03-01 48/week @ 2024-03-08 63/week @ 2024-03-15 23/week @ 2024-03-22 38/week @ 2024-03-29 160/week @ 2024-04-05

291 downloads per month
Used in tantivy-analysis-contrib

Apache-2.0

450KB
10K SLoC

Crate Build Status codecov dependency status Documentation Crate Crate

Rust phonetic

This is a rust port of v1.15 Apache commons-codec's phonetic algorithms.

Algorithms

Currently, there are :

Please note that most of these algorithms are design for the latin alphabet, and they are usually design for certain use case (eg. english names / english dictonary words, ...etc).

Examples

Beider-Morse

fn main() -> Result<(), rphonetic::PhoneticError> {
    use std::path::PathBuf;
    use rphonetic::{BeiderMorseBuilder, ConfigFiles, Encoder};

    let config_files = ConfigFiles::new(&PathBuf::from("./test_assets/cc-rules/"))?;
    let builder = BeiderMorseBuilder::new(&config_files);
    let beider_morse = builder.build();

    assert_eq!(beider_morse.encode("Van Helsing"),"(Ylznk|ilzn|ilznk|xilzn|xilznk)-(banilznk|bonilznk|fYnYlznk|fYnilznk|fanYlznk|fanilznk|fonYlznk|fonilznk|vYnYlznk|vYnilznk|vanYlznk|vaniilznk|vanilzn|vanilznk|vonYlznk|voniilznk|vonilzn|vonilznk)");
    Ok(())
}

Caverphone 1 & 2

fn main() {
    use rphonetic::{Caverphone1, Encoder};

    let caverphone = Caverphone1;
    assert_eq!(caverphone.encode("Thompson"), "TMPSN1");
}
fn main() {
    use rphonetic::{Caverphone2, Encoder};

    let caverphone = Caverphone2;
    assert_eq!(caverphone.encode("Thompson"), "TMPSN11111");
}

Cologne

fn main() {
    use rphonetic::{Cologne, Encoder};

    let cologne = Cologne;
    assert_eq!(cologne.encode("m\u{00FC}ller"), "657");
}

Daitch-Mokotoff

fn main() -> Result<(), rphonetic::PhoneticError> {
    use rphonetic::{DaitchMokotoffSoundex, DaitchMokotoffSoundexBuilder, Encoder};

    const COMMONS_CODEC_RULES: &str = include_str!("./rules/dmrules.txt");

    let encoder = DaitchMokotoffSoundexBuilder::with_rules(COMMONS_CODEC_RULES).build()?;
    assert_eq!(encoder.soundex("Rosochowaciec"), "944744|944745|944754|944755|945744|945745|945754|945755");
    Ok(())
}

Match Rating Approach

fn main() {
    use rphonetic::{Encoder, MatchRatingApproach};
    
    let match_rating = MatchRatingApproach;
    assert_eq!(match_rating.encode("Smith"), "SMTH");
}

Metaphone

fn main() {
    use rphonetic::{Encoder, Metaphone};
    
    let metaphone = Metaphone::default();
    assert_eq!(metaphone.encode("Joanne"), "JN");
}

Metaphone (Double)

fn main() {
    use rphonetic::{DoubleMetaphone, Encoder};

    let double_metaphone = DoubleMetaphone::default();
    assert_eq!(double_metaphone.encode("jumped"), "JMPT");
    assert_eq!(double_metaphone.encode_alternate("jumped"), "AMPT");
}

Phonex

fn main() {
    use rphonetic::{Phonex, Encoder};

    // Strict
    let phonex = Phonex::default();
    assert_eq!(phonex.encode("William"),"W450");
}

Nysiis

fn main() {
    use rphonetic::{Nysiis, Encoder};

    // Strict
    let nysiis = Nysiis::default();
    assert_eq!(nysiis.encode("WESTERLUND"),"WASTAR");

    // Not strict
    let nysiis = Nysiis::new(false);
    assert_eq!(nysiis.encode("WESTERLUND"),"WASTARLAD");
}

Soundex

fn main() {
    use rphonetic::{Encoder, Soundex};

    let soundex = Soundex::default();
    assert_eq!(soundex.encode("jumped"), "J513");
}

Soundex (Refined)

fn main() {
    use rphonetic::{Encoder, RefinedSoundex};
    
    let refined_soundex = RefinedSoundex::default();
    assert_eq!(refined_soundex.encode("jumped"), "J408106");
}

Benchmarking

Benchmarking use criterion.

They were done on an Intel® Core™ i7-4720HQ with 16GB RAM.

To run benches against main baseline :

cargo bench --bench benchmark -- --baseline main

To replace main baseline :

cargo bench --bench benchmark -- --save-baseline main

Do not run Criterion benches on CI .

Dependencies

~3.5–5.5MB
~100K SLoC