3 releases

Uses new Rust 2024

0.1.2 Feb 5, 2026
0.1.1 Jan 30, 2026
0.1.0 Jan 21, 2026

#72 in Internationalization (i18n)

Apache-2.0

475KB
20K SLoC

budoux-phf-rs

Rust implementation of BudouX, the machine learning-based line break organizer tool.

Features

  • Zero runtime dictionary loading: Uses PHF (Perfect Hash Functions) to embed dictionaries as compile-time lookup tables
  • Fast and efficient: PHF provides O(1) lookup with minimal memory overhead
  • No external dependencies at runtime: All data is baked into the binary
  • Multiple language support: Japanese (ja), Simplified Chinese (zh-hans), Traditional Chinese (zh-hant), Thai (th)

Installation

Add this to your Cargo.toml:

[dependencies]
budoux-phf-rs = "0.1"

Usage

Basic Usage

use budoux_phf_rs::Parser;

fn main() {
    // Create a parser with Japanese model
    let parser = Parser::japanese_parser();

    let text = "今日は天気です。";
    let chunks: Vec<&str> = parser.parse(text);

    println!("{:?}", chunks);
    // => ["今日は", "天気です。"]
}

Other Languages

use budoux_phf_rs::Parser;

fn main() {
    // Simplified Chinese
    let parser_zh_hans = Parser::simplified_chinese_parser();

    // Traditional Chinese
    let parser_zh_hant = Parser::traditional_chinese_parser();

    // Thai
    let parser_th = Parser::thai_parser();
}

Custom Model

use budoux_phf_rs::{Model, Parser, ScoreMap};

// You can use `codegen` to convert from json to a model.
const MY_MODEL: Mode = Model {
    total_score: 2552,
    uw1: &UW1,
    uw2: &UW2,
    uw3: &UW3,
    uw4: &UW4,
    uw5: &UW5,
    uw6: &UW6,
    bw1: &BW1,
    bw2: &BW2,
    bw3: &BW3,
    tw1: &TW1,
    tw2: &TW2,
    tw3: &TW3,
    tw4: &TW4,
};
static UW1: ScoureMap = phf::Map { ...  };
static UW2: ScoureMap = phf::Map { ...  };
...

fn main() {
    let parser = Parser { model: MY_MODEL };
}

Feature Flags

By default, all language models are included. You can select specific languages to reduce binary size:

[dependencies]
# Include only Japanese
budoux_phf_rs = { version = "0.1", default-features = false, features = ["ja"] }

# Include Japanese and Simplified Chinese
budoux_phf_rs = { version = "0.1", default-features = false, features = ["ja", "zh_hans"] }

Available features:

Feature Language Description
ja Japanese Japanese model
zh_hans Simplified Chinese Simplified Chinese model
zh_hant Traditional Chinese Traditional Chinese model
th Thai Thai model

Build model

$ cargo run -p codegen <path/to/budoux/budoux/models> lib/src/

License

Licensed under the Apache License, Version 2.0. See LICENSE for details.

Acknowledgments

Dependencies

~270–670KB
~16K SLoC