#llm #training #transformer #deep-learning #gpt

bin+lib hermes-llm

LLM training from scratch using Candle

177 stable releases

Uses new Rust 2024

new 1.8.23 Feb 28, 2026
1.8.20 Feb 27, 2026
1.4.37 Jan 31, 2026

#239 in Machine learning

MIT license

160KB
3.5K SLoC

hermes-llm

Train Large Language Models from scratch in Rust using Candle.

Features

  • Model Architecture Language (MAL): Define any transformer architecture using a composable DSL
  • Transformer Architecture with configurable attention (GQA, sliding window), normalization, and FFN
  • Well-Known Models: Bundled architectures (nano, tiny, GPT-2, LLaMA, Mistral)
  • BPE Tokenizer Training using HuggingFace tokenizers
  • Training Infrastructure: AdamW optimizer, gradient clipping, checkpointing, interruptible training
  • Text Generation: Temperature sampling, top-k sampling
  • Distributed Training: Multi-GPU support with NCCL
  • Backend Support: CPU, CUDA, Metal (Apple Silicon), Accelerate

Installation

# CPU only (default)
cargo build --release -p hermes-llm

# With CUDA support
cargo build --release -p hermes-llm --features cuda

# With Metal support (macOS)
cargo build --release -p hermes-llm --features metal

# With Accelerate (macOS)
cargo build --release -p hermes-llm --features accelerate

Usage

Train a tokenizer

hermes-llm train-tokenizer \
  --input data/corpus.txt \
  --output tokenizer.json \
  --vocab-size 32000

Train a model

# Using a well-known model
hermes-llm train \
  --data data/corpus.txt \
  --tokenizer tokenizer.json \
  --model tiny \
  --output checkpoints

# Or use full well-known path
hermes-llm train \
  --model well-known/mistral-7b.mal \
  ...

# Or use a custom .mal file
hermes-llm train \
  --model my_custom_model.mal \
  ...

Well-known models: nano, tiny, gpt2-small, gpt2-medium, gpt2-large, llama-small, llama-7b, mistral-7b

Generate text

hermes-llm generate \
  --checkpoint checkpoints/checkpoint_epoch_10.safetensors \
  --config checkpoints/config.json \
  --tokenizer tokenizer.json \
  --prompt "Once upon a time" \
  --max-tokens 100 \
  --temperature 0.8

Show model info

hermes-llm info --model gpt2-small

Multi-GPU Training (NCCL)

For distributed training, just add --num-gpus:

# Build with NCCL support
cargo build --release -p hermes-llm --features cuda --features nccl

# Single GPU
hermes-llm train --data corpus.jsonl --tokenizer tok.json --model gpt2-small

# 4 GPUs (automatically uses NCCL)
hermes-llm train --data corpus.jsonl --tokenizer tok.json --model gpt2-small --num-gpus 4

Consumer GPUs (RTX 3090, 4090, etc.)

Consumer GPUs without NVLink don't support GPU peer-to-peer access. If you see errors like peer access is not supported between these two devices, disable P2P and SHM:

NCCL_P2P_DISABLE=1 NCCL_SHM_DISABLE=1 hermes-llm train \
  --data corpus.jsonl --tokenizer tok.json --model gpt2-medium --num-gpus 3
Variable Description
NCCL_P2P_DISABLE=1 Disable direct GPU-to-GPU communication
NCCL_SHM_DISABLE=1 Disable shared memory (uses CUDA IPC which needs peer access)
NCCL_DEBUG=INFO Enable debug logging (troubleshooting)

Note: With both disabled, NCCL uses socket-based communication which is slower but works on any multi-GPU setup.

Training Options

Option Default Description
--data (stdin) Training data file
--tokenizer required Tokenizer file path
--model tiny Model preset
--num-gpus 1 Number of GPUs (>1 enables NCCL)
--batch-size 32 Batch size per GPU
--grad-accum 1 Gradient accumulation steps
--epochs 1 Training epochs
--lr 3e-4 Learning rate
--output checkpoints Output directory

Effective Batch Size

effective_batch = batch_size × grad_accum × num_gpus

Example: --batch-size 32 --grad-accum 4 --num-gpus 4 = 512 effective batch

Fine-tuning

Continue training from a pre-trained checkpoint:

hermes-llm train \
  --checkpoint pretrained.safetensors \
  --data finetune-data.jsonl \
  --tokenizer tok.json \
  --model gpt2-small \
  --lr 1e-5 \
  --epochs 3

Fine-tuning Options

Option Description
--checkpoint Path to pre-trained weights (.safetensors)
--freeze-layers Number of layers to freeze from bottom (default: 0)
--lr Use lower LR for fine-tuning (e.g., 1e-5)

Freezing Layers

Freeze early layers to preserve general knowledge while adapting top layers:

hermes-llm train \
  --checkpoint pretrained.safetensors \
  --data domain-data.jsonl \
  --tokenizer tok.json \
  --freeze-layers 8 \
  --lr 5e-5

Direct Preference Optimization (DPO)

Align your model to human preferences without a separate reward model:

hermes-llm dpo \
  --checkpoint sft-model.safetensors \
  --config checkpoints/config.json \
  --data preferences.jsonl \
  --tokenizer tok.json \
  --beta 0.1 \
  --lr 5e-7 \
  --epochs 1

Preference Data Format

JSONL file with prompt, chosen, and rejected fields:

{"prompt": "What is 2+2?", "chosen": "4", "rejected": "5"}
{"prompt": "Explain gravity:", "chosen": "Gravity is...", "rejected": "Idk lol"}

DPO Options

Option Default Description
--checkpoint required SFT model to start from
--config required Model config JSON
--data required Preference pairs (JSONL)
--beta 0.1 KL divergence penalty
--lr 5e-7 Learning rate (very low for DPO)
--max-len 512 Max sequence length
--output checkpoints-dpo Output directory

Model Configurations

Config Layers Hidden Heads Params (32K vocab)
nano 2 64 2 ~4M
tiny 4 128 4 ~9M
gpt2-small 12 768 12 ~124M
gpt2-medium 24 1024 16 ~355M
gpt2-large 36 1280 20 ~774M
llama-small 16 1024 16 ~268M
llama-7b 32 4096 32 ~7B

Note: Parameter count depends heavily on vocab size. Run hermes-llm info --model <name> for exact counts.

Model Architecture Language (MAL)

MAL is a composable DSL for defining LLM architectures. Models are built from reusable components: attention, ffn, and block.

Example

# my_model.mal

# Define attention mechanism
attention my_attn {
    num_heads: 16
    num_kv_heads: 4      # Grouped Query Attention
    bias: false
}

# Define FFN
ffn my_ffn {
    hidden_dim: 4096
    activation: swiglu
    bias: false
}

# Define transformer block
block my_block {
    attention: my_attn
    ffn: my_ffn
    norm: rmsnorm { eps: 1e-5 }
    norm_position: pre
    residual: true
}

# Define complete model
model my_model {
    description: "Custom model"
    vocab_size: 32000
    max_seq_len: 4096
    hidden_size: 1024
    num_layers: 16
    block: my_block
}

Use it with:

hermes-llm train --model my_model.mal --data corpus.jsonl --tokenizer tok.json

MAL Components

Component Properties
attention num_heads, num_kv_heads, head_dim, bias, dropout, causal, window_size
ffn hidden_dim, activation (swiglu/gelu/silu/relu), bias, dropout, gate
block attention, ffn, norm (rmsnorm/layernorm), norm_position (pre/post), residual
model vocab_size, hidden_size, max_seq_len, num_layers, block, description

Architecture

The model implements a modern transformer architecture:

  • Embeddings: Token embeddings (no position embeddings - uses RoPE)
  • Attention: Multi-head self-attention with RoPE (Rotary Position Embedding)
  • Normalization: RMSNorm (pre-normalization)
  • FFN: SwiGLU activation for LLaMA-style, GELU for GPT-style
  • Output: Tied embeddings with language modeling head

Library Usage

use hermes_llm::{Config, GPT, Trainer};
use hermes_llm::config::TrainingConfig;
use hermes_llm::data::{Dataset, DataLoader};
use hermes_llm::tokenizer::Tokenizer;
use candle_core::Device;

// Load or train tokenizer
let tokenizer = Tokenizer::from_file("tokenizer.json")?;

// Create model config
let mut config = Config::tiny();
config.vocab_size = tokenizer.vocab_size();

// Load dataset
let dataset = Dataset::from_file("data.txt", &tokenizer, 256)?;
let mut loader = DataLoader::new(dataset, 32, true);

// Create trainer
let device = Device::Cpu;
let training_config = TrainingConfig::default();
let mut trainer = Trainer::new(config, training_config, device)?;

// Train
trainer.train(&mut loader, None, Some("checkpoints"))?;

References

License

MIT

Dependencies

~45–69MB
~1M SLoC