### 1 unstable release

0.1.0 | Nov 25, 2019 |
---|

#**1115** in Algorithms

**MIT**license

**1.5MB**

474 lines

# clusterphobia - A Rust Crate for People Who Fear Clustering

This crate is based on ideas and algorithms originally developed in C# in this repository:

https://github.com/paulchernoch/HilbertTransformation

For an understanding of the theory, see **SLASH.pdf** in the

github repo:`clusterphobia`

https://github.com/paulchernoch/clusterphobia

## Project Goals

**Clusterphobia** is intended to provide clustering tools for *scalable*, *high-dimensional*, *unassisted*, *flat*, *single classification* of data.

**Scalable**: A bedrock feature of the library is the use of the Hilbert Curve (from the**hilbert**crate) to speed up**k-nearest neighbor searches**. This algorithm is*linear*in the number of dimensions, whereas many others grow*exponentially*with the number of dimensions. Parts of the algorithm have running times proportional to N Log N, where N is the number of points being clustered.**High-dimensional**- The algorithm will work with up to ten thousand dimensions.**Unassisted**- Clustering is performed on unlabeled data.**Flat**- The algorithms group items into flat, mutually exclusive sets of categories, not hierarchies.**Single**- Each item is assigned a single category. Items cannot be part of multiple categories.

Two main approaches will be used to cluster data:

- bottom-up, single-link agglomerative clustering
- bottom-up, density-based clustering

## Functional Areas

Once complete, the crate will have these main areas of functionality:

**Data preparation**: Adjust raw data until the coordinate values are commensurate, non-negative integers. The adjustments include scaling, shifting and weighting each dimension. For documents modeled as bags-of-words, a novel, randomized transformation will be used. (Some of the transformations are supplied by the**Hilbert**crate.)**Point representation**: Convert the prepared data into

suitable for the clustering algorithms. (The`Points`

structure is defined in the`Point`**Hilbert**crate and employs an**optimized Euclidean distance**formula, critical for nearest neighbor searches.)**Linkage analysis**- The

data is analyzed (using the`Point`**Hilbert curve**) and the linkage distance and density threshold are derived. These values are crucial for the clustering algorithms. (Two points separated by less than the**linkage distance**are assumed to belong to the same cluster. If a set number of points fall inside a region whose radius does not exceed the**density threshold**, the

density is considered high enough to force a clustering.)`Point`**Cluster representation**- A

holds one`Clustering`

for each category in the partitioning scheme. It supports operations like adding points to clusters, recategorizing them, and merging clusters. The`Cluster`

is the principal output of the clustering process.`Clustering`**Cluster Similarity**- To test and iteratively refine choices of dimensions, scaling factors, thresholds and tunable parameters, you need to be able to compare a perfect, assisted solution to the unassisted clustering. This crate provides an optimized implementation of the**B-Cubed**measure with a linear (not a quadratic) runtime.**Clustering algorithms**- These algorithms will arrange a collection of

into a`Points`

.`Clustering`**Problem data**- The same test data sets show up in numerous papers, because they contain features that defeat many clustering algorithms, such as spirals, clusters with noise or partial overlaps. The algorithms in this crate will be tuned to be able to address as many as possible of these problems.

## Currently Working

The following features are ready for use:

- Some data preparation transforms (from the
**hilbert**crate), including

and`IntegerDataRange`

.`FloatDataRange` - Hilbert Curve Transformation, permutations and sorting (from the
**hilbert**crate)

struct with optimized distance formula (from the`Point`**hilbert**crate)

struct which can be used to build and modify classification schemes.`Clustering`

struct which can represent a`BCubed`*similarity*measure and compute the similarity between two clusters (essential for unit tests and tuning).

## Cluster Similarity

The literature is teaming with algorithms for cluster similarity. None handle all the edge cases perfectly. A measure that handles many commons use cases well and is not too expensive to run is called **B-Cubed**, named after initials taken from the names of its creators. Since initially proposed, others have published modifications to it.

The B-Cubed measure was proposed in this 1998 paper:

[1] **A. Bagga and B. Baldwin**. *Entity-based cross-document coreferencing using the vector space model*. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics - Volume 1, ACL ’98, pages 79–85, 1998.

The following paper compared many clustering algorithms and found **B-Cubed** the best according to four formal constraints:

- Cluster Homogeneity
- Cluster Completeness
- Rag Bag
- Cluster Size vs quantity

[2] *A comparison of Extrinsic Clustering Evaluation Metrics based on Formal Constraints* by **Enrique Amigo, Julio Gonzalo, Javier Artiles, Felisa Verdejo**
of the Departamento de Lenguajes y Sistemas Informaticos UNED, Madrid, Spain, May 11, 2009

The above is available here: http://nlp.uned.es/docs/amigo2007a.pdf

A subsequent paper identified a use case where **B-Cubed** fared poorly: unbalanced datasets where one cluster dominates:

[3] *Adapted B-CUBED Metrics to Unbalanced Datasets* by Jose G. Moreno and Gaël Dias,
both of Normandie University in France.

This third paper proposed a refined version of **B-Cubed**, but the added complexity adds significantly to processing time, so those refinements are not employed here. The definition of the algorithm used here is taken from section 2.1 of this
last paper, where it combines the Precision and Recall values into a single number using the **F-measure** formula (a harmonic average). (The refined version is in section 2.2.)

` 𝔽 ``=` F`-`measure `(``final` similarity measure`)`
ℙ `=` Precision `(`a measure of homogeneity`)`
ℝ `=` Recall `(`a measure of completeness`)`
α `=` Weighting factor `(`defaults to `0.``5``)`
ℕ `=` Number of points
k `=` Number of categories `(`varies between the π and π`*` Clusterings`)`
i `=` category index
πᵢ `=` cluster solution `for` the ith category
π`*`ᵢ`=` gold standard `for` the ith category
g₀ `=` tests whether two items share the same category `in` the clustering
g`*`₀`=` tests whether two items share the same category `in` the gold standard
`𝟙` α `𝟙` `-` α
━━━━━ ═ ━━━━━ `+` ━━━━━
𝔽 ℙ ℝ
b³ b³ b³
k
ℙ `𝟙` ⎲ `𝟙` ⎲ ⎲
b³ ═ ━━━ ⎳ ━━━━━ ⎳ ⎳ g`*`₀`(`xⱼ`,`xₗ`)`
ℕ i`=``1` `|`πᵢ`|` xⱼ∈πᵢ xₗ∈πᵢ
k
ℝ `𝟙` ⎲ `𝟙` ⎲ ⎲
b³ ═ ━━━ ⎳ ━━━━━ ⎳ ⎳ g₀`(`xⱼ`,`xₗ`)`
ℕ i`=``1` `|`π`*`ᵢ`|` xⱼ∈π`*`ᵢ xₗ∈π`*`ᵢ
`(` `𝟙` ⟺ ∃l`:`xᵢ∈πₗ ∧ xⱼ∈πₗ
g₀`(`xᵢ`,`xⱼ`)` ═ `<`
`(` `𝟘``,` otherwise
`(` `𝟙` ⟺ ∃l`:`xᵢ∈π`*`ₗ ∧ xⱼ∈π`*`ₗ
g`*`₀`(`xᵢ`,`xⱼ`)` ═ `<`
`(` `𝟘``,` otherwise

If seeing the triply-nested summation symbols makes you balk and say, "that will take forever to run", that was my reaction as well. The inner two loops perform a sort of second moment computation, trying to see what proportion of items that share the same cluster according to one scheme share the matching clusters in the other scheme, and vice versa. Here is an example.

` ``1` `2` `3` `4`
A A B B
┍━━━━━━━━━┓
`1` A│ ✓✓ │
`2` A│ ✓✓ │
`3` B│ ✓✓ │
`4` B│ ✓✓ │
┗━━━━━━━━━┛

In this example:

- The first clustering has items 1,2,3,4 all in the same cluster.
- The second clustering puts 1&2 in cluster A and 3&4 in cluster B.
- The grid shows which cluster in the second clustering each pont goes, and pairs of points that continue to be clustered together are marked with checks (✓).

Measuring the effect of the second clustering on the first clustering's items consists in counting the checkmarks in the grid. The full B-Cubed measure normalizes the measure for each cluster by its cluster size.

The optimization that I found was that if two items end up in the same cluster, there will be four check marks, but if three match, there will be nine checks, etc. Thus if you had a list of all the categories in the second clustering that points map into and a count of how many items end up in each category, then you merely need to sum the squares of those counts.

I figured out another trick - you could do this in a single pass! Keeping a running sum of the squares of counts in each category, if you see a count was one and increase it to two, then you add the difference between one and four, which is three. If going from two to three, the increase in the sum of squares is 9 - 4 = 5. The difference is always

.`2` `*` previous_count `+` `1`

Anyway, that is how I reduced a quadratic algorithm to a linear one.

## Coming Soon...

The next functionality that will be implemented is **Linkage analysis**.

#### Dependencies

~9–20MB

~267K SLoC