#ngrams #language-model #text-generation #nlp #machine-learning

bin+lib rml-core

A simple N-gram language model implementation in Rust

1 unstable release

new 0.1.0 Apr 1, 2025

#675 in Machine learning

MIT license

22KB
383 lines

RML Core – Rust Language Model

rml-core is a simple N-Gram language model implemented in Rust. The name is a play on LLM (Large Language Models) and stands for "Rust Language Model".


🧠 Overview

This project implements a character-level N-Gram language model using a basic neural architecture with one hidden layer. It uses a context of 4 characters to predict the next character in a sequence.


✨ Features

  • Train a language model on any text file
  • Generate text based on a seed string
  • Supports letters, numbers, and basic punctuation
  • Configurable training parameters (e.g. number of epochs)
  • Save and load trained models

🚀 Installation

In your Cargo.toml:

[dependencies]
rml-core = "0.1.0"

🧪 Usage

📌 Train a model

cargo run --bin train path/to/input.txt path/to/output/model [epochs]

Example:

cargo run --bin train data/shakespeare.txt model.bin 10

📌 Generate text

cargo run --bin generate path/to/model "Seed Text" [length]

Example:

cargo run --bin generate model.bin "To be" 200

📚 Use as a library

use rml_core::{NGramModel, prepare_training_data};

// Training
let text = std::fs::read_to_string("data/input.txt").unwrap();
let training_data = prepare_training_data(&text);
let mut model = NGramModel::new();

for (context, target) in training_data {
    model.train(&context, target);
}

model.save("model.bin").unwrap();

// Generation
let mut model = NGramModel::load("model.bin").unwrap();
// Use model.forward() and sampling logic

⚙️ How It Works

  1. Preprocessing: The input text is filtered to include only allowed characters (ASCII a–z, A–Z, 0–9, punctuation).
  2. Training Data: Generates (context, target) pairs where the context is 4 characters long.
  3. Training: The model learns to predict the next character using backpropagation.
  4. Generation: Given a seed, the model predicts the next character and slides the context window forward.

🧬 Technical Details

Component Value
Context Size 4 characters
Hidden Layer 128 neurons
Learning Rate 0.005
Sampling Temperature 0.3 (conservative)
Vocabulary a-z, A-Z, 0–9, punctuation

🔍 Example

# Train on Shakespeare for 10 epochs
cargo run --bin train data/shakespeare.txt shakespeare_model 10

# Generate 200 characters using "To be" as the seed
cargo run --bin generate shakespeare_model "To be" 200

🤝 Contributing

Contributions are welcome! Please feel free to open an issue or submit a pull request.


📄 License

This project is licensed under the MIT License.


Feel free to customize this README to fit your needs.

Dependencies

~340KB