#markov-chain #name #syllable #generate #random #place #pronunciation

name-engine

A Rust library for computing Markov chains to generate random names based on pronunciation

1 unstable release

0.1.0 Feb 20, 2024

#504 in Algorithms

MPL-2.0 license

33KB
277 lines

name-engine

Preview: Generating English Place Names examples/england_evaluated.rs

placename-generation

name-engine is a basic library for computing Markov chains to generate names based on their pronunciation.

This can be used for various purposes, but primarily for generating place names.

Algorithm

This library computes Markov chains from a dataset of names. The names must be separated by certain user-defined rules, such as syllables. Each of the separated units is treated as a state of the Markov chain.

The transition is defined as the connection between the pronunciations. For example:

  • ŋ -> w in Ringwood /ˈrɪŋwʊd/ [(Ring /ˈrɪŋ/) (wood /wʊd/)]

  • k -> ə in Beccles /ˈbɛkəlz/ [(Becc /ˈbɛk/) (les /əlz/)]

  • k -> ə and m -> s in Berkhamsted /ˈbɜːrkəmstɛd/ [(Berk /ˈbɜːrk/) (ham /əm/) (sted /stɛd/)]

With the data adove, the model can generate Berkles from (Berk /ˈbɜːrk/) and (les /əlz/) by tracking the transition k -> ə.

The probability of the transition is calculated from the frequency of the connection in the dataset.

Features

This library does:

  • Create name generator from dataset of separated names.
  • Generate names using Markov chains.

This library DOES NOT:

  • Read and parse data from a file.
  • Automatically separate original names according to specific rules, such as syllables. You must prepare the dataset yourself.
  • Evaluate names. If you want to generate better names, you must implement the evaluation function and filtering process by yourself.
  • Combine another parameters. If you want to do, NameGenerator::generate_verbose is useful.

This library only does the minimal processing necessary to generate names. To create a more practical name generator, some additional processing like above will be required.

Documentation

Run cargo doc --open to see the documentation.

If you want to try it out, see the examples in examples/. For the first step, examples/japanese.rs is suitable for reading.

Installation

[dependencies]
name-engine = "0.1.0"

Examples

Generate 100 place names of Hokkaido

$ cargo run --example hokkaido
中富 nakatomi
初威冠 shoikappu
上沢 kamizawa

Generate 100 place names of England

$ cargo run --example england
Stoneon /ˈstəʊnən/
Thatchingworth /ˈθætʃɪŋwɜːθ/
Brentgomley /ˈbrɛntɡʌmli/

Generate 100 place names of England (extracted better ones)

$ cargo run --example england_evaluated
Oltham Abbey /ˈoʊlθəm ˈæbi/
Downbury /ˈdaʊnbəri/
Farhead /ˈfɑːrhɛd/

Generate 100 place names of US (extracted better ones)

$ cargo run --example us_evaluated
Winfield /ˈwɪnfiːld/
Perton /ˈpɛrtən/
Kinbridge Falls /ˈkɪnbrɪdʒ fɔːlz/

About the English and US place name data for the examples

For English and US place name data, some symbols are added for better results.

  • Spaces are replaced by + and treated as independent syllables.
  • For the syllable with capital letter, an asterisk * is added at the beginning of the pronunciation to become the first syllable of the name or the next syllable of +.
  • For the pronunciation of the previous syllable of +, an asterisk * is added at the end of the pronunciation to become the previous syllable of +.

Example

Tunbridge Wells /ˈtʌnbrɪdʒ ˈwɛlz/
(Tun, /*ˈtʌn/) (bridge, /brɪdʒ*/) (+, /+/) (Wells, /*ˈwɛlz/)
  • (Tun /ˈtʌn/) -> (Tun /*ˈtʌn/) [2]
  • (bridge /brɪdʒ/) -> (bridge /brɪdʒ*/) [3]
  • (+ /+/) [1]
  • (Wells /ˈwɛlz/) -> (Wells /*ˈwɛlz/) [2]

Moreover, some suffexes are treated as independent syllables, such as minster and bridge.

Data Source

examples/assets/hokkaido.csv: Hokkaido Government Opendata CC-BY4.0(https://creativecommons.org/licenses/by/4.0/deed.ja) Modified from the original data.

Source: https://www.pref.hokkaido.lg.jp/link/shichoson/aiueo.html

License

This project is licensed under the Mozilla Public License v2.0. See the LICENSE file for details.

Dependencies

~320–780KB
~18K SLoC