rphonetic

Rust port of phonetic Apache commons-codec algorithms

11 stable releases

2.1.4 Jan 30, 2024
2.1.2 Nov 26, 2023
2.1.1 Sep 25, 2023
2.1.0 May 10, 2023
1.2.0 Jul 31, 2022

#95 in Text processing

Download history 59/week @ 2023-10-26 61/week @ 2023-11-02 100/week @ 2023-11-09 68/week @ 2023-11-16 81/week @ 2023-11-23 52/week @ 2023-11-30 22/week @ 2023-12-07 43/week @ 2023-12-14 55/week @ 2023-12-21 20/week @ 2023-12-28 22/week @ 2024-01-04 27/week @ 2024-01-11 71/week @ 2024-01-18 100/week @ 2024-01-25 77/week @ 2024-02-01 46/week @ 2024-02-08

297 downloads per month
Used in tantivy-analysis-contrib

Apache-2.0

445KB
10K SLoC

Crate Build Status codecov dependency status Documentation Crate Crate

Rust phonetic

This is a rust port of v1.15 Apache commons-codec's phonetic algorithms.

Algorithms

Currently, there are :

Please note that most of these algorithms are design for the latin alphabet, and they are usually design for certain use case (eg. english names / english dictonary words, ...etc).

Examples

Beider-Morse

fn main() -> Result<(), rphonetic::PhoneticError> {
    use std::path::PathBuf;
    use rphonetic::{BeiderMorseBuilder, ConfigFiles, Encoder};

    let config_files = ConfigFiles::new(&PathBuf::from("./test_assets/cc-rules/"))?;
    let builder = BeiderMorseBuilder::new(&config_files);
    let beider_morse = builder.build();

    assert_eq!(beider_morse.encode("Van Helsing"),"(Ylznk|ilzn|ilznk|xilzn|xilznk)-(banilznk|bonilznk|fYnYlznk|fYnilznk|fanYlznk|fanilznk|fonYlznk|fonilznk|vYnYlznk|vYnilznk|vanYlznk|vaniilznk|vanilzn|vanilznk|vonYlznk|voniilznk|vonilzn|vonilznk)");
    Ok(())
}

Caverphone 1 & 2

fn main() {
    use rphonetic::{Caverphone1, Encoder};

    let caverphone = Caverphone1;
    assert_eq!(caverphone.encode("Thompson"), "TMPSN1");
}
fn main() {
    use rphonetic::{Caverphone2, Encoder};

    let caverphone = Caverphone2;
    assert_eq!(caverphone.encode("Thompson"), "TMPSN11111");
}

Cologne

fn main() {
    use rphonetic::{Cologne, Encoder};

    let cologne = Cologne;
    assert_eq!(cologne.encode("m\u{00FC}ller"), "657");
}

Daitch-Mokotoff

fn main() -> Result<(), rphonetic::PhoneticError> {
    use rphonetic::{DaitchMokotoffSoundex, DaitchMokotoffSoundexBuilder, Encoder};

    const COMMONS_CODEC_RULES: &str = include_str!("./rules/dmrules.txt");

    let encoder = DaitchMokotoffSoundexBuilder::with_rules(COMMONS_CODEC_RULES).build()?;
    assert_eq!(encoder.soundex("Rosochowaciec"), "944744|944745|944754|944755|945744|945745|945754|945755");
    Ok(())
}

Match Rating Approach

fn main() {
    use rphonetic::{Encoder, MatchRatingApproach};
    
    let match_rating = MatchRatingApproach;
    assert_eq!(match_rating.encode("Smith"), "SMTH");
}

Metaphone

fn main() {
    use rphonetic::{Encoder, Metaphone};
    
    let metaphone = Metaphone::default();
    assert_eq!(metaphone.encode("Joanne"), "JN");
}

Metaphone (Double)

fn main() {
    use rphonetic::{DoubleMetaphone, Encoder};

    let double_metaphone = DoubleMetaphone::default();
    assert_eq!(double_metaphone.encode("jumped"), "JMPT");
    assert_eq!(double_metaphone.encode_alternate("jumped"), "AMPT");
}

Phonex

fn main() {
    use rphonetic::{Phonex, Encoder};

    // Strict
    let phonex = Phonex::default();
    assert_eq!(phonex.encode("William"),"W450");
}

Nysiis

fn main() {
    use rphonetic::{Nysiis, Encoder};

    // Strict
    let nysiis = Nysiis::default();
    assert_eq!(nysiis.encode("WESTERLUND"),"WASTAR");

    // Not strict
    let nysiis = Nysiis::new(false);
    assert_eq!(nysiis.encode("WESTERLUND"),"WASTARLAD");
}

Soundex

fn main() {
    use rphonetic::{Encoder, Soundex};

    let soundex = Soundex::default();
    assert_eq!(soundex.encode("jumped"), "J513");
}

Soundex (Refined)

fn main() {
    use rphonetic::{Encoder, RefinedSoundex};
    
    let refined_soundex = RefinedSoundex::default();
    assert_eq!(refined_soundex.encode("jumped"), "J408106");
}

Benchmarking

Benchmarking use criterion.

They were done on an Intel® Core™ i7-4720HQ with 16GB RAM.

To run benches against main baseline :

cargo bench --bench benchmark -- --baseline main

To replace main baseline :

cargo bench --bench benchmark -- --save-baseline main

Do not run Criterion benches on CI .

Dependencies

~3.5–5.5MB
~100K SLoC