3 unstable releases
0.2.1 | Jun 18, 2025 |
---|---|
0.2.0 | Jun 18, 2025 |
0.1.0 | Jun 4, 2025 |
#345 in Data structures
315 downloads per month
2MB
821 lines
Kentro - High-Performance K-Means Clustering Library
A high-performance Rust implementation of K-Means clustering algorithms. Kentro provides both standard and advanced K-Means variants with parallel processing support.
π Features
- Standard K-Means: Classic Lloyd's algorithm implementation
- Spherical K-Means: Uses cosine similarity instead of Euclidean distance
- Balanced K-Means: Ensures clusters have similar sizes using efficient balancing algorithms
- K-Medoids: PAM (Partition Around Medoids) algorithm for robust clustering with actual data points as centers
- Parallel Processing: Multi-threaded execution using Rayon
- Flexible API: Builder pattern for easy configuration
- Memory Efficient: Optimized for large datasets
- Comprehensive Error Handling: Detailed error types and messages
π¦ Installation
Add this to your Cargo.toml
:
[dependencies]
kentro = "0.1.0"
ndarray = "0.15"
π§ Quick Start
use kentro::KMeans;
use ndarray::Array2;
// Create sample data (100 points, 2 dimensions)
let data = Array2::from_shape_vec((100, 2),
(0..200).map(|x| x as f32).collect()).unwrap();
// Create and configure K-Means
let mut kmeans = KMeans::new(3)
.with_iterations(50)
.with_verbose(true);
// Train the model
let clusters = kmeans.train(data.view(), None).unwrap();
println!("Found {} clusters", clusters.len());
for (i, cluster) in clusters.iter().enumerate() {
println!("Cluster {}: {} points", i, cluster.len());
}
π API Reference
Creating a K-Means Instance
use kentro::KMeans;
let kmeans = KMeans::new(3) // 3 clusters
.with_iterations(50) // 50 iterations (default: 25)
.with_euclidean(true) // Use Euclidean distance (default: false)
.with_balanced(true) // Enable balanced clustering (default: false)
.with_use_medoids(true) // Enable K-medoids clustering (default: false)
.with_max_balance_diff(10) // Max cluster size difference (default: 16)
.with_verbose(true); // Enable verbose output (default: false)
Training
// Train on your data
let clusters = kmeans.train(data.view(), Some(4))?; // Use 4 threads
// Returns Vec<Vec<usize>> where each inner vector contains
// the indices of points assigned to that cluster
Assignment
// Assign new points to existing clusters
let assignments = kmeans.assign(new_data.view(), 1)?; // k=1 (nearest cluster)
// Multi-assignment (assign to k nearest clusters)
let multi_assignments = kmeans.assign(new_data.view(), 2)?; // k=2
Accessing Results
// Get centroids
if let Some(centroids) = kmeans.centroids() {
println!("Centroids shape: {:?}", centroids.dim());
}
// Get medoid indices (when using K-medoids)
if let Some(medoids) = kmeans.medoid_indices() {
println!("Medoid indices: {:?}", medoids);
}
// Check model state
println!("Trained: {}", kmeans.is_trained());
println!("Clusters: {}", kmeans.n_clusters());
println!("Euclidean: {}", kmeans.is_euclidean());
println!("Balanced: {}", kmeans.is_balanced());
println!("Using medoids: {}", kmeans.is_use_medoids());
π― Algorithm Variants
Standard K-Means
Uses cosine similarity (inner product) as the distance metric. Suitable for high-dimensional data and text clustering.
let mut kmeans = KMeans::new(5);
Euclidean K-Means
Uses Euclidean distance as the distance metric. Better for geometric data.
let mut kmeans = KMeans::new(5).with_euclidean(true);
Balanced K-Means
Ensures clusters have similar sizes using the algorithm from:
Rieke de Maeyer, Sami Sieranoja, and Pasi FrΓ€nti. "Balanced k-means revisited." Applied Computing and Intelligence, 3(2):145β179, 2023.
let mut kmeans = KMeans::new(5)
.with_balanced(true)
.with_max_balance_diff(10);
K-Medoids
Uses actual data points as cluster centers instead of computed centroids. More robust to outliers and provides interpretable cluster representatives.
let mut kmeans = KMeans::new(5)
.with_use_medoids(true)
.with_euclidean(true);
// After training, get the medoid indices
let clusters = kmeans.train(data.view(), None)?;
if let Some(medoids) = kmeans.medoid_indices() {
println!("Medoid points: {:?}", medoids);
}
β‘ Performance Features
Parallel Processing
Kentro automatically uses all available CPU cores for cluster assignment. You can control the number of threads:
// Use 4 threads
let clusters = kmeans.train(data.view(), Some(4))?;
// Use all available cores (default)
let clusters = kmeans.train(data.view(), None)?;
Memory Efficiency
- Uses
ndarray
for efficient matrix operations - Minimal memory allocations during iterations
- Supports large datasets through optimized algorithms
π Examples
Basic Clustering
use kentro::KMeans;
use ndarray::Array2;
fn basic_example() -> Result<(), Box<dyn std::error::Error>> {
// Generate sample data
let data = Array2::from_shape_vec((300, 2),
generate_sample_data(300, 2))?;
// Cluster into 3 groups
let mut kmeans = KMeans::new(3).with_verbose(true);
let clusters = kmeans.train(data.view(), None)?;
// Print results
for (i, cluster) in clusters.iter().enumerate() {
println!("Cluster {}: {} points", i, cluster.len());
}
Ok(())
}
Balanced Clustering
fn balanced_example() -> Result<(), Box<dyn std::error::Error>> {
let data = Array2::from_shape_vec((1000, 10),
generate_high_dim_data(1000, 10))?;
let mut kmeans = KMeans::new(5)
.with_balanced(true)
.with_max_balance_diff(20)
.with_iterations(100);
let clusters = kmeans.train(data.view(), None)?;
// Verify balance
let sizes: Vec<usize> = clusters.iter().map(|c| c.len()).collect();
let max_size = *sizes.iter().max().unwrap();
let min_size = *sizes.iter().min().unwrap();
println!("Cluster size range: {} - {} (diff: {})",
min_size, max_size, max_size - min_size);
Ok(())
}
K-Medoids Clustering
fn medoids_example() -> Result<(), Box<dyn std::error::Error>> {
// Generate sample data with some outliers
let mut data_vec = vec![];
// Cluster 1: around (1, 1)
data_vec.extend_from_slice(&[1.0, 1.0, 1.1, 1.1, 1.2, 1.0, 0.9, 1.1]);
// Cluster 2: around (5, 5)
data_vec.extend_from_slice(&[5.0, 5.0, 5.1, 5.1, 4.9, 5.0, 5.0, 4.9]);
// Outlier
data_vec.extend_from_slice(&[10.0, 1.0]);
let data = Array2::from_shape_vec((9, 2), data_vec)?;
// Use K-Medoids for robustness to outliers
let mut kmeans = KMeans::new(3)
.with_use_medoids(true)
.with_euclidean(true)
.with_verbose(true);
let clusters = kmeans.train(data.view(), None)?;
// Get the actual data points used as cluster centers
if let Some(medoids) = kmeans.medoid_indices() {
println!("Medoid points:");
for (i, &medoid_idx) in medoids.iter().enumerate() {
let medoid_point = data.row(medoid_idx);
println!(" Cluster {}: Point {} [{:.1}, {:.1}]",
i, medoid_idx, medoid_point[0], medoid_point[1]);
}
}
Ok(())
}
### Text Clustering (Spherical K-Means)
```rust
fn text_clustering_example() -> Result<(), Box<dyn std::error::Error>> {
// Assume we have TF-IDF vectors
let tfidf_vectors = load_tfidf_data()?; // Your TF-IDF data
// Use spherical K-Means (cosine similarity)
let mut kmeans = KMeans::new(10)
.with_euclidean(false) // Use cosine similarity
.with_iterations(50);
let clusters = kmeans.train(tfidf_vectors.view(), None)?;
println!("Clustered {} documents into {} topics",
tfidf_vectors.nrows(), clusters.len());
Ok(())
}
π οΈ Error Handling
Kentro provides comprehensive error handling:
use kentro::{KMeans, KMeansError};
match kmeans.train(data.view(), None) {
Ok(clusters) => println!("Success: {} clusters", clusters.len()),
Err(KMeansError::InsufficientPoints(n, k)) => {
println!("Error: {} points < {} clusters", n, k);
},
Err(KMeansError::AlreadyTrained) => {
println!("Error: Model already trained");
},
Err(e) => println!("Error: {}", e),
}
π§ͺ Testing
Run the test suite:
cargo test
Run with verbose output:
cargo test -- --nocapture
π Running Examples
# Run the main example
cargo run --example simple
# Run the K-Medoids demo
cargo run --example medoids_demo
# Run with release optimizations
cargo run --example simple --release
π Benchmarks
Kentro is designed for high performance:
- Parallel Processing: Scales with CPU cores
- Memory Efficient: Minimal allocations during clustering
- Optimized Algorithms: Based on proven efficient implementations
π License
This project is licensed under the Apache 2.0 License - see the LICENSE file for details.
Dependencies
~2.6β3.5MB
~72K SLoC