## catclustering

Agglomerative Clustering For Categorical Data

### 3 unstable releases

 0.2.1 Jun 21, 2024 Jun 20, 2024 Jun 15, 2024

#777 in Algorithms

30KB
480 lines

# catclustering

This crate implements hierarchical agglomerative clustering optimized for categorical features and (approximated) complete linkage.

## Initialization Step

To avoid the computational inefficiency of calculating all pairwise distances for the adjacency matrix, the algorithm identifies neighbors by examining adjacent rows in the sorted dataset.

Since the order of the columns affects the outcome, the dataset is sorted multiple times, each time with a different random column order. This approach is roughly equivalent to identifying neighbors through random projections onto a one-dimensional line. The number of iterations can be controlled with the 'init_iterations' parameter.

This initial step makes this crate particularly well-suited for categorical data.

## Distance Function

To ensure the algorithm works correctly, the distance function must satisfy the following property:

``````d(x1 U x2, y1 U y2) >= d(x1, y1)
``````

for any cluster, x1, x2, y1, y2, including empty clusters.

This property is satisfied by the original complete-linkage distance.

However, because complete linkage is computationally expensive, we recommend the following approximation, which also exhibits the desired property:

``````|clusterset1| + |clusterset2| - |intersection(clusterset1, clusterset2)|
``````

`|clusterset1|` is what remains when you calculate the distance between a cluster and itself (`d(X, X) = |X| + |X| - |intersection(X, X)| = |X|`).

This is analogous to the original complete linkage, where `d(X, X)` equals X's intra-cluster distance when the cluster contains at least two distinct rows. From this perspective, `|clusterset1| - 1` can be interpreted as an approximation of the intra-cluster distance.

Note that functions like `|union(clusterset1, clusterset2)| - |intersection(clusterset1, clusterset2)|` do not satisfy `d(x1 U x2, y1 U y2) >= d(x1, y1) ` and would not work correctly in this context.

### Distance Implementation

The fastest way to implement set operations is by using bit masks, provided the number of unique categories fits into the masks. You can combine bit masks and hashsets or roaring bitmaps to fit your needs.

operation \ data structure bitmasks hashsets roaring bitmaps
cardinality of intersection (x&y).count_ones() x.intersection(&y).count() x.intersection_len(&y)
cardinality of union (x|y).count_ones() x.union(&y).count() x.union_len(&y)
cardinality of symmetric difference (x^y).count_ones() x.symmetric_difference(&y).count() x.symmetric_difference_len(&y)
merge x |= y x.extend(&y) x.extend(&y)

## Example

Here is an example for a dataset that comprises 4 low-cardinality columns and 1 high-cardinality column.

``````use std::vec::Vec;
use std::collections::HashSet;
use std::any::Any;
use rand::Rng;

struct SimpleMatrix {
col1to4: u32,
col5: HashSet<u16>
}

struct MyData {
vecs: Vec<Vec<i32>>
}

impl catclustering::ClusterSummary for SimpleMatrix {
fn summary_size(&self) -> usize {
self.col1to4.count_ones() as usize + self.col5.len()
}
fn distance(&self, other: &dyn catclustering::ClusterSummary) -> f32 {
let o = other.as_any().downcast_ref::<SimpleMatrix>().unwrap();
let intersection = (self.col1to4 & o.col1to4).count_ones() as usize + self.col5.intersection(&o.col5).count();

(self.summary_size() + other.summary_size()) as f32 - intersection as f32
}

fn extend(&mut self, other: &dyn catclustering::ClusterSummary) {
let o = other.as_any().downcast_ref::<SimpleMatrix>().unwrap();

self.col1to4 |= o.col1to4;
self.col5.extend(&o.col5);
}
fn clear(&mut self) {
self.col5.clear();
self.col5.shrink_to_fit();
}
fn as_any(&self) -> &dyn Any {
self
}
}

impl catclustering::IndexableData for MyData {
fn get_value(&self, row_index: usize, column_index: usize) -> f32 {
self.vecs[row_index][column_index] as f32
}

fn get_num_columns(&self) -> usize {
self.vecs[0].len()
}

fn get_num_rows(&self) -> usize {
self.vecs.len()
}

fn create_cluster_summary(&self, row_index: usize) -> Box<dyn catclustering::ClusterSummary> {
let row = &self.vecs[row_index];
Box::new(SimpleMatrix {
col1to4: (1 << row[0])
| (1 << (row[1] + 8))
| (1 << (row[2] + 16))
| (1 << (row[3] + 24)),
col5: HashSet::from_iter(vec![row[4] as u16]),
})
}
}

/// Just a utility function to create a random data set
fn create_random_matrix(rows: usize, cardinality: [i32; 5]) -> Vec<Vec<i32>> {
let mut matrix = Vec::with_capacity(rows);

for _ in 0..rows {
let row: Vec<i32> = (0..cardinality.len())
.map(|k| rng.gen_range(0..1000) % cardinality[k])
.collect();
matrix.push(row);
}

matrix
}

fn main() {
let matrix = MyData{vecs: create_random_matrix(10_000, [8, 8, 8, 8, 2000])};

// main algorithm
let dendro = catclustering::create_dendrogram(&matrix, None, &mut rng);

// more interpretable results
let clusters = catclustering::find_clusters(&dendro, 100);

// or, if you just want the assignments:
let mut assignments = Vec::new(); // feel free to reuse this vector
let num_clusters = catclustering::assign_rows_to_clusters(&dendro, &mut assignments, 100);
}
``````

## Benchmarks

The benchmarks are run from the example above.

rows time
10,000 2.5 s
100,000 29 s
1,000,000 4 min
10,000,000 46 min

Specs:

• i7-6700 @ 3.40GHz
• Linux pop-os 6.5.4
• rustc 1.76.0

~315KB