#k-means #cluster-analysis #balanced #euclidean #points #error #basic #parallel #balancing

kentro

A high-performance Rust implementation of K-Means clustering algorithms

3 unstable releases

0.2.1 Jun 18, 2025
0.2.0 Jun 18, 2025
0.1.0 Jun 4, 2025

#345 in Data structures

Download history 105/week @ 2025-05-31 19/week @ 2025-06-07 215/week @ 2025-06-14 55/week @ 2025-06-21 10/week @ 2025-06-28

315 downloads per month

Apache-2.0

2MB
821 lines

Kentro - High-Performance K-Means Clustering Library

Rust Build Status Crates.io Documentation License: Apache 2.0

A high-performance Rust implementation of K-Means clustering algorithms. Kentro provides both standard and advanced K-Means variants with parallel processing support.

πŸš€ Features

  • Standard K-Means: Classic Lloyd's algorithm implementation
  • Spherical K-Means: Uses cosine similarity instead of Euclidean distance
  • Balanced K-Means: Ensures clusters have similar sizes using efficient balancing algorithms
  • K-Medoids: PAM (Partition Around Medoids) algorithm for robust clustering with actual data points as centers
  • Parallel Processing: Multi-threaded execution using Rayon
  • Flexible API: Builder pattern for easy configuration
  • Memory Efficient: Optimized for large datasets
  • Comprehensive Error Handling: Detailed error types and messages

πŸ“¦ Installation

Add this to your Cargo.toml:

[dependencies]
kentro = "0.1.0"
ndarray = "0.15"

πŸ”§ Quick Start

use kentro::KMeans;
use ndarray::Array2;

// Create sample data (100 points, 2 dimensions)
let data = Array2::from_shape_vec((100, 2), 
    (0..200).map(|x| x as f32).collect()).unwrap();

// Create and configure K-Means
let mut kmeans = KMeans::new(3)
    .with_iterations(50)
    .with_verbose(true);

// Train the model
let clusters = kmeans.train(data.view(), None).unwrap();

println!("Found {} clusters", clusters.len());
for (i, cluster) in clusters.iter().enumerate() {
    println!("Cluster {}: {} points", i, cluster.len());
}

πŸ“š API Reference

Creating a K-Means Instance

use kentro::KMeans;

let kmeans = KMeans::new(3)                    // 3 clusters
    .with_iterations(50)                       // 50 iterations (default: 25)
    .with_euclidean(true)                      // Use Euclidean distance (default: false)
    .with_balanced(true)                       // Enable balanced clustering (default: false)
    .with_use_medoids(true)                    // Enable K-medoids clustering (default: false)
    .with_max_balance_diff(10)                 // Max cluster size difference (default: 16)
    .with_verbose(true);                       // Enable verbose output (default: false)

Training

// Train on your data
let clusters = kmeans.train(data.view(), Some(4))?; // Use 4 threads

// Returns Vec<Vec<usize>> where each inner vector contains 
// the indices of points assigned to that cluster

Assignment

// Assign new points to existing clusters
let assignments = kmeans.assign(new_data.view(), 1)?; // k=1 (nearest cluster)

// Multi-assignment (assign to k nearest clusters)
let multi_assignments = kmeans.assign(new_data.view(), 2)?; // k=2

Accessing Results

// Get centroids
if let Some(centroids) = kmeans.centroids() {
    println!("Centroids shape: {:?}", centroids.dim());
}

// Get medoid indices (when using K-medoids)
if let Some(medoids) = kmeans.medoid_indices() {
    println!("Medoid indices: {:?}", medoids);
}

// Check model state
println!("Trained: {}", kmeans.is_trained());
println!("Clusters: {}", kmeans.n_clusters());
println!("Euclidean: {}", kmeans.is_euclidean());
println!("Balanced: {}", kmeans.is_balanced());
println!("Using medoids: {}", kmeans.is_use_medoids());

🎯 Algorithm Variants

Standard K-Means

Uses cosine similarity (inner product) as the distance metric. Suitable for high-dimensional data and text clustering.

let mut kmeans = KMeans::new(5);

Euclidean K-Means

Uses Euclidean distance as the distance metric. Better for geometric data.

let mut kmeans = KMeans::new(5).with_euclidean(true);

Balanced K-Means

Ensures clusters have similar sizes using the algorithm from:

Rieke de Maeyer, Sami Sieranoja, and Pasi FrΓ€nti. "Balanced k-means revisited." Applied Computing and Intelligence, 3(2):145–179, 2023.

let mut kmeans = KMeans::new(5)
    .with_balanced(true)
    .with_max_balance_diff(10);

K-Medoids

Uses actual data points as cluster centers instead of computed centroids. More robust to outliers and provides interpretable cluster representatives.

let mut kmeans = KMeans::new(5)
    .with_use_medoids(true)
    .with_euclidean(true);

// After training, get the medoid indices
let clusters = kmeans.train(data.view(), None)?;
if let Some(medoids) = kmeans.medoid_indices() {
    println!("Medoid points: {:?}", medoids);
}

⚑ Performance Features

Parallel Processing

Kentro automatically uses all available CPU cores for cluster assignment. You can control the number of threads:

// Use 4 threads
let clusters = kmeans.train(data.view(), Some(4))?;

// Use all available cores (default)
let clusters = kmeans.train(data.view(), None)?;

Memory Efficiency

  • Uses ndarray for efficient matrix operations
  • Minimal memory allocations during iterations
  • Supports large datasets through optimized algorithms

πŸ” Examples

Basic Clustering

use kentro::KMeans;
use ndarray::Array2;

fn basic_example() -> Result<(), Box<dyn std::error::Error>> {
    // Generate sample data
    let data = Array2::from_shape_vec((300, 2), 
        generate_sample_data(300, 2))?;
    
    // Cluster into 3 groups
    let mut kmeans = KMeans::new(3).with_verbose(true);
    let clusters = kmeans.train(data.view(), None)?;
    
    // Print results
    for (i, cluster) in clusters.iter().enumerate() {
        println!("Cluster {}: {} points", i, cluster.len());
    }
    
    Ok(())
}

Balanced Clustering

fn balanced_example() -> Result<(), Box<dyn std::error::Error>> {
    let data = Array2::from_shape_vec((1000, 10), 
        generate_high_dim_data(1000, 10))?;
    
    let mut kmeans = KMeans::new(5)
        .with_balanced(true)
        .with_max_balance_diff(20)
        .with_iterations(100);
    
    let clusters = kmeans.train(data.view(), None)?;
    
    // Verify balance
    let sizes: Vec<usize> = clusters.iter().map(|c| c.len()).collect();
    let max_size = *sizes.iter().max().unwrap();
    let min_size = *sizes.iter().min().unwrap();
    
    println!("Cluster size range: {} - {} (diff: {})", 
             min_size, max_size, max_size - min_size);
    
    Ok(())
}

K-Medoids Clustering

fn medoids_example() -> Result<(), Box<dyn std::error::Error>> {
    // Generate sample data with some outliers
    let mut data_vec = vec![];
    
    // Cluster 1: around (1, 1)
    data_vec.extend_from_slice(&[1.0, 1.0, 1.1, 1.1, 1.2, 1.0, 0.9, 1.1]);
    // Cluster 2: around (5, 5) 
    data_vec.extend_from_slice(&[5.0, 5.0, 5.1, 5.1, 4.9, 5.0, 5.0, 4.9]);
    // Outlier
    data_vec.extend_from_slice(&[10.0, 1.0]);
    
    let data = Array2::from_shape_vec((9, 2), data_vec)?;
    
    // Use K-Medoids for robustness to outliers
    let mut kmeans = KMeans::new(3)
        .with_use_medoids(true)
        .with_euclidean(true)
        .with_verbose(true);
    
    let clusters = kmeans.train(data.view(), None)?;
    
    // Get the actual data points used as cluster centers
    if let Some(medoids) = kmeans.medoid_indices() {
        println!("Medoid points:");
        for (i, &medoid_idx) in medoids.iter().enumerate() {
            let medoid_point = data.row(medoid_idx);
            println!("  Cluster {}: Point {} [{:.1}, {:.1}]", 
                     i, medoid_idx, medoid_point[0], medoid_point[1]);
        }
    }
    
    Ok(())
}

### Text Clustering (Spherical K-Means)

```rust
fn text_clustering_example() -> Result<(), Box<dyn std::error::Error>> {
    // Assume we have TF-IDF vectors
    let tfidf_vectors = load_tfidf_data()?; // Your TF-IDF data
    
    // Use spherical K-Means (cosine similarity)
    let mut kmeans = KMeans::new(10)
        .with_euclidean(false)  // Use cosine similarity
        .with_iterations(50);
    
    let clusters = kmeans.train(tfidf_vectors.view(), None)?;
    
    println!("Clustered {} documents into {} topics", 
             tfidf_vectors.nrows(), clusters.len());
    
    Ok(())
}

πŸ› οΈ Error Handling

Kentro provides comprehensive error handling:

use kentro::{KMeans, KMeansError};

match kmeans.train(data.view(), None) {
    Ok(clusters) => println!("Success: {} clusters", clusters.len()),
    Err(KMeansError::InsufficientPoints(n, k)) => {
        println!("Error: {} points < {} clusters", n, k);
    },
    Err(KMeansError::AlreadyTrained) => {
        println!("Error: Model already trained");
    },
    Err(e) => println!("Error: {}", e),
}

πŸ§ͺ Testing

Run the test suite:

cargo test

Run with verbose output:

cargo test -- --nocapture

πŸš€ Running Examples

# Run the main example
cargo run --example simple

# Run the K-Medoids demo
cargo run --example medoids_demo

# Run with release optimizations
cargo run --example simple --release

πŸ“Š Benchmarks

Kentro is designed for high performance:

  • Parallel Processing: Scales with CPU cores
  • Memory Efficient: Minimal allocations during clustering
  • Optimized Algorithms: Based on proven efficient implementations

πŸ“„ License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

Dependencies

~2.6–3.5MB
~72K SLoC