#similarity #similarity-metrics #correlation #similarity-measure #distance #entropy #distance-metrics #statistics #parallel-processing #collection-traits

similarity

A comprehensive Rust library for calculating similarity metrics between vectors, collections, and spectral data with both functional and trait-based APIs

4 releases

new 0.2.0 Jul 2, 2025
0.1.2 May 11, 2024
0.1.1 Feb 11, 2024
0.1.0 Jan 22, 2024

#2523 in Algorithms

Download history 3/week @ 2025-05-07

182 downloads per month

MIT license

71KB
1K SLoC

Similarity

Crates.io Documentation License: MIT

A comprehensive Rust library for calculating similarity metrics between vectors, collections, and spectral data. Features both functional and trait-based APIs with optional parallel processing and FFT optimizations.

Features

  • Semantically Correct Trait System: Separate traits for different types of calculations
  • Zero-Cost Abstractions: Trait calls compile to direct function calls
  • Extensive Metric Coverage: Distance, similarity, correlation, and entropy measures
  • Spectral Data Support: Specialized functions for mass spectrometry and signal processing
  • Performance Optimizations: Parallel processing and FFT-based algorithms
  • Feature Gates: Optional dependencies for parallel and FFT features

Architecture

The library is organized into three main trait categories:

1. Similarity<InputType, OutputType>

For comparing multiple entities and computing similarity or distance metrics:

  • Cosine similarity/distance
  • Euclidean distance
  • Pearson correlation distance
  • Jaccard index for sets
  • Cross-correlation and time shift detection
  • Hit rate and overshoot rate for predictions
  • Entropy similarity between spectra

2. EntropyMeasure<InputType, OutputType>

For analyzing single entities with information-theoretic measures:

  • Shannon entropy
  • Tsallis entropy with parameter q
  • Both standard and optimized implementations

3. DataTransform<InputType, OutputType>

For preprocessing and transforming data:

  • Weight factor transformation for spectral data
  • Optimized implementations for large datasets

Quick Start

Add to your Cargo.toml:

[dependencies]
similarity = "0.2.0"

Trait-Based API Examples

use similarity::*;
use similarity::similarity_traits::*;
use similarity::entropy_traits::*;
use similarity::transform_traits::*;
use std::collections::HashSet;

// Similarity between vectors
let a = [1.0, 2.0, 3.0];
let b = [2.0, 4.0, 6.0];
let cosine_sim = CosineSimilarity::similarity((&a, &b));
let euclidean_dist = EuclideanDistance::similarity((&a, &b));

// Set similarity
let mut set1 = HashSet::new();
set1.extend([1, 2, 3]);
let mut set2 = HashSet::new(); 
set2.extend([2, 3, 4]);
let jaccard = JaccardIndex::similarity((&set1, &set2));

// Entropy of spectral data
let spectrum = Spectrum::from_peaks(vec![
    Peak { mz: 100.0, intensity: 0.6 },
    Peak { mz: 200.0, intensity: 0.4 },
]);
let shannon_entropy = ShannonEntropy::entropy(&spectrum);
let tsallis_entropy = TsallisEntropy::entropy((&spectrum, 2.0));

// Data transformation
let mzs = [100.0, 200.0, 300.0];
let intensities = [0.5, 0.3, 0.2];
let transformed = WeightFactorTransformation::transform((&mzs, &intensities, 0.5, 2.0));

Functional API (Still Available)

use similarity::*;

// All original functions remain available
let cosine_sim = cosine_similarity(&a, &b);
let euclidean_dist = euclidean_distance(&a, &b);
let entropy = calculate_entropy(&spectrum);

Performance Features

Optional Dependencies

[dependencies]
similarity = { version = "0.2.0", features = ["parallel", "fft"] }
  • parallel: Enables Rayon-based parallel processing for large datasets
  • fft: Enables FFT-based optimizations for cross-correlation and convolution

Performance Comparison

The trait-based API has zero performance overhead compared to direct function calls:

// These are equivalent in performance:
let result1 = cosine_similarity(&a, &b);
let result2 = CosineSimilarity::similarity((&a, &b));

For large datasets (10K+ elements), use the optimized variants:

let large_a: Vec<f64> = (0..100_000).map(|i| i as f64).collect();
let large_b: Vec<f64> = (0..100_000).map(|i| i as f64 * 1.1).collect();

// 2-3x faster for large vectors
let result = CosineSimilarityOptimized::similarity((&large_a, &large_b));

// Even faster with parallel processing
let result = CosineSimilarityParallel::similarity((&large_a, &large_b));

Examples

Run the comprehensive demo:

cargo run --example trait_demo

This demonstrates all available trait implementations with:

  • Similarity and distance metrics
  • Entropy measures for spectral data
  • Data transformations
  • Performance comparisons
  • FFT-optimized operations

API Documentation

Trait Definitions

// For comparing two entities
pub trait Similarity<InputType, OutputType> {
    fn similarity(input: InputType) -> OutputType;
}

// For analyzing single entities  
pub trait EntropyMeasure<InputType, OutputType> {
    fn entropy(input: InputType) -> OutputType;
}

// For transforming data
pub trait DataTransform<InputType, OutputType> {
    fn transform(input: InputType) -> OutputType;
}

Available Implementations

Similarity Traits:

  • CosineSimilarity, CosineSimilarityOptimized, CosineSimilarityParallel
  • CosineDistance, CosineDistanceOptimized, CosineDistanceParallel
  • EuclideanDistance, SquaredEuclideanDistance
  • PearsonCorrelationDistance, PearsonCorrelationDistanceOptimized, PearsonCorrelationDistanceParallel
  • JaccardIndex
  • HitRate, OvershootRate
  • CrossCorrelationOptimized, CrossCorrelationParallel, CrossCorrelationFFTOptimized
  • TimeShiftFinder, TimeShiftFinderFFT
  • EntropySimilarity, EntropySimilarityOptimized

Entropy Traits:

  • ShannonEntropy, ShannonEntropyOptimized
  • TsallisEntropy, TsallisEntropyOptimized

Transform Traits:

  • WeightFactorTransformation, WeightFactorTransformationOptimized

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Contributions are welcome. Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

Dependencies

~0.5–1.5MB
~26K SLoC