5 releases
new 0.1.0-alpha.5 | Jun 23, 2025 |
---|---|
0.1.0-alpha.4 | Jun 3, 2025 |
0.1.0-alpha.3 | May 19, 2025 |
0.1.0-alpha.2 | May 8, 2025 |
0.1.0-alpha.1 | Apr 12, 2025 |
#1032 in Algorithms
194 downloads per month
Used in 2 crates
10MB
202K
SLoC
SciRS2 Clustering Module
A comprehensive clustering module for the SciRS2 scientific computing library in Rust. This crate provides production-ready implementations of various clustering algorithms with a focus on performance, SciPy compatibility, and idiomatic Rust code.
Production Readiness - Final Alpha Release
🎯 Version 0.1.0-alpha.5 is the final alpha release, ready for production use with:
- 189+ comprehensive tests covering all algorithms and edge cases
- Zero warnings policy enforced across all code and examples
- Full SciPy API compatibility maintained for seamless migration
- Extensive documentation with working examples for all features
- Performance optimizations including SIMD and parallel processing
Stability & Performance
Algorithm Maturity
- Core algorithms (K-means, Hierarchical, DBSCAN) are thoroughly tested and production-ready
- Advanced algorithms (Spectral, BIRCH, GMM, HDBSCAN) are fully implemented with comprehensive test coverage
- All APIs are stable and maintain backward compatibility with SciPy interfaces
Performance Characteristics
- Optimized Ward's method: O(n² log n) complexity vs standard O(n³)
- SIMD acceleration: Up to 4x faster distance computations on supported hardware
- Parallel processing: Multi-core implementations for K-means and hierarchical clustering
- Memory efficiency: Streaming and chunked processing for large datasets (>10M points)
Features
-
Vector Quantization
- K-means clustering with multiple initialization methods
- K-means++ smart initialization
- kmeans2 with SciPy-compatible interface
- Mini-batch K-means for large datasets
- Parallel K-means for multi-core systems
- Data whitening/normalization utilities
-
Hierarchical Clustering
- Agglomerative clustering with multiple linkage methods:
- Single linkage (minimum distance)
- Complete linkage (maximum distance)
- Average linkage
- Ward's method (minimizes variance)
- Centroid method (distance between centroids)
- Median method
- Weighted average
- Dendrogram utilities and flat cluster extraction
- Cluster distance metrics (Euclidean, Manhattan, Chebyshev, Correlation)
- Agglomerative clustering with multiple linkage methods:
-
Density-Based Clustering
- DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
- OPTICS (Ordering Points To Identify the Clustering Structure)
- HDBSCAN (Hierarchical DBSCAN)
- Support for custom distance metrics
-
Other Algorithms
- Mean-shift clustering
- Spectral clustering
- Affinity propagation
- BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies)
- Gaussian Mixture Models (GMM)
-
Evaluation Metrics
- Silhouette coefficient
- Davies-Bouldin index
- Calinski-Harabasz index
- Adjusted Rand Index
- Normalized Mutual Information
- Homogeneity, Completeness, and V-measure
Installation
Add this to your Cargo.toml
:
[dependencies]
scirs2-cluster = "0.1.0-alpha.5"
ndarray = "0.15"
To enable optimizations through the core module, add feature flags:
[dependencies]
scirs2-cluster = { version = "0.1.0-alpha.5", features = ["parallel", "simd"] }
Usage
K-means Example
use ndarray::Array2;
use scirs2_cluster::vq::{kmeans, KMeansOptions, KMeansInit};
// Create a dataset
let data = Array2::from_shape_vec((6, 2), vec![
1.0, 2.0,
1.2, 1.8,
0.8, 1.9,
3.7, 4.2,
3.9, 3.9,
4.2, 4.1,
]).unwrap();
// Configure K-means
let options = KMeansOptions {
init_method: KMeansInit::KMeansPlusPlus,
max_iter: 300,
..Default::default()
};
// Run k-means with k=2
let (centroids, labels) = kmeans(data.view(), 2, Some(options)).unwrap();
println!("Centroids: {:?}", centroids);
println!("Cluster assignments: {:?}", labels);
kmeans2 (SciPy-compatible)
use scirs2_cluster::vq::{kmeans2, MinitMethod, MissingMethod, whiten};
// Whiten the data for better clustering
let whitened_data = whiten(&data).unwrap();
// Run kmeans2 with different initialization methods
let (centroids, labels) = kmeans2(
whitened_data.view(),
3, // k clusters
Some(10), // iterations
Some(1e-4), // threshold
Some(MinitMethod::PlusPlus), // K-means++ initialization
Some(MissingMethod::Warn), // warn on empty clusters
Some(true), // check finite values
Some(42), // random seed
).unwrap();
Mini-batch K-means
use scirs2_cluster::vq::{minibatch_kmeans, MiniBatchKMeansOptions};
// Configure mini-batch K-means
let options = MiniBatchKMeansOptions {
batch_size: 1024,
max_iter: 100,
..Default::default()
};
// Run clustering on large dataset
let (centroids, labels) = minibatch_kmeans(large_data.view(), 5, Some(options)).unwrap();
Hierarchical Clustering Example
use ndarray::Array2;
use scirs2_cluster::hierarchy::{linkage, fcluster, LinkageMethod};
// Create a dataset
let data = Array2::from_shape_vec((6, 2), vec![
1.0, 2.0,
1.2, 1.8,
0.8, 1.9,
3.7, 4.2,
3.9, 3.9,
4.2, 4.1,
]).unwrap();
// Calculate linkage matrix using Ward's method
let linkage_matrix = linkage(data.view(), LinkageMethod::Ward, None).unwrap();
// Form flat clusters by cutting the dendrogram
let num_clusters = 2;
let labels = fcluster(&linkage_matrix, num_clusters, None).unwrap();
println!("Cluster assignments: {:?}", labels);
Evaluation Metrics
use scirs2_cluster::metrics::{silhouette_score, davies_bouldin_score, calinski_harabasz_score};
// Evaluate clustering quality
let silhouette = silhouette_score(data.view(), labels.view()).unwrap();
let db_score = davies_bouldin_score(data.view(), labels.view()).unwrap();
let ch_score = calinski_harabasz_score(data.view(), labels.view()).unwrap();
println!("Silhouette score: {}", silhouette);
println!("Davies-Bouldin score: {}", db_score);
println!("Calinski-Harabasz score: {}", ch_score);
DBSCAN Example
use ndarray::Array2;
use scirs2_cluster::density::{dbscan, labels};
// Create a dataset with clusters and noise
let data = Array2::from_shape_vec((8, 2), vec![
1.0, 2.0, // Cluster 1
1.5, 1.8, // Cluster 1
1.3, 1.9, // Cluster 1
5.0, 7.0, // Cluster 2
5.1, 6.8, // Cluster 2
5.2, 7.1, // Cluster 2
0.0, 10.0, // Noise
10.0, 0.0, // Noise
]).unwrap();
// Run DBSCAN with eps=0.8 and min_samples=2
let cluster_labels = dbscan(data.view(), 0.8, 2, None).unwrap();
// Count noise points
let noise_count = cluster_labels.iter().filter(|&&label| label == labels::NOISE).count();
println!("Cluster assignments: {:?}", cluster_labels);
println!("Number of noise points: {}", noise_count);
Documentation
- Algorithm Comparison Guide - Comprehensive guide to choosing the right clustering algorithm for your use case
Key Enhancements
Production-Ready SciPy Compatibility
- Complete API compatibility with SciPy's cluster module
- Drop-in replacement for most SciPy clustering functions
- Identical parameter names and behavior for seamless migration
- Compatible return value formats with proper error handling
High-Performance Computing
- SIMD acceleration with automatic fallback for unsupported hardware
- Multi-core parallelism via Rayon for CPU-intensive operations
- Memory-efficient streaming for datasets larger than available RAM
- Optimized algorithms that outperform reference implementations
Rust Ecosystem Advantages
- Memory safety without runtime overhead
- Zero-copy operations where possible for maximum efficiency
- Compile-time correctness with comprehensive type checking
- Predictable performance with no garbage collection pauses
License
This project is dual-licensed under:
You can choose to use either license. See the LICENSE file for details.
Contributing
Contributions are welcome! Please see the project's CONTRIBUTING.md file for guidelines.
Dependencies
~79MB
~1M SLoC