#clustering #cluster #means #simd-vector

nightly kmeans

Small and fast library for k-means clustering calculations

3 unstable releases

0.10.0 Jun 21, 2024
0.2.1 Jan 3, 2024
0.2.0 Oct 8, 2020
0.1.0 Jul 27, 2019

#266 in Algorithms

Download history 44/week @ 2024-03-10 23/week @ 2024-03-17 21/week @ 2024-03-24 95/week @ 2024-03-31 21/week @ 2024-04-07 36/week @ 2024-04-14 17/week @ 2024-04-21 8/week @ 2024-04-28 7/week @ 2024-05-05 5/week @ 2024-05-12 17/week @ 2024-05-19 16/week @ 2024-05-26 23/week @ 2024-06-02 21/week @ 2024-06-09 155/week @ 2024-06-16 28/week @ 2024-06-23

228 downloads per month
Used in 2 crates

Apache-2.0

105KB
1.5K SLoC

kmeans

Current Crates.io Version docs

kmeans is a small and fast library for k-means clustering calculations. It requires a nightly compiler with the portable_simd feature to work.

Here is a small example, using kmean++ as initialization method and lloyd as k-means variant:

use kmeans::*;

fn main() {
    let (sample_cnt, sample_dims, k, max_iter) = (20000, 200, 4, 100);

    // Generate some random data
    let mut samples = vec![0.0f64;sample_cnt * sample_dims];
    samples.iter_mut().for_each(|v| *v = rand::random());

    // Calculate kmeans, using kmean++ as initialization-method
    // KMeans<_, 8> specifies to use f64 SIMD vectors with 8 lanes (e.g. AVX512)
    let kmean: KMeans<_, 8> = KMeans::new(samples, sample_cnt, sample_dims);
    let result = kmean.kmeans_lloyd(k, max_iter, KMeans::init_kmeanplusplus, &KMeansConfig::default());

    println!("Centroids: {:?}", result.centroids);
    println!("Cluster-Assignments: {:?}", result.assignments);
    println!("Error: {}", result.distsum);
}

Datastructures

For performance-reasons, all calculations are done on bare vectors, using hand-written SIMD intrinsics from the packed_simd crate. All vectors are stored row-major, so each sample is stored in a consecutive block of memory.

Supported variants / algorithms

  • lloyd (standard kmeans)
  • minibatch

Supported centroid initialization methods

  • KMean++
  • random partition
  • random sample

Dependencies

~2MB
~40K SLoC