4 releases
new 0.2.0 | Jul 2, 2025 |
---|---|
0.1.2 | May 11, 2024 |
0.1.1 | Feb 11, 2024 |
0.1.0 | Jan 22, 2024 |
#2523 in Algorithms
182 downloads per month
71KB
1K
SLoC
Similarity
A comprehensive Rust library for calculating similarity metrics between vectors, collections, and spectral data. Features both functional and trait-based APIs with optional parallel processing and FFT optimizations.
Features
- Semantically Correct Trait System: Separate traits for different types of calculations
- Zero-Cost Abstractions: Trait calls compile to direct function calls
- Extensive Metric Coverage: Distance, similarity, correlation, and entropy measures
- Spectral Data Support: Specialized functions for mass spectrometry and signal processing
- Performance Optimizations: Parallel processing and FFT-based algorithms
- Feature Gates: Optional dependencies for parallel and FFT features
Architecture
The library is organized into three main trait categories:
1. Similarity<InputType, OutputType>
For comparing multiple entities and computing similarity or distance metrics:
- Cosine similarity/distance
- Euclidean distance
- Pearson correlation distance
- Jaccard index for sets
- Cross-correlation and time shift detection
- Hit rate and overshoot rate for predictions
- Entropy similarity between spectra
2. EntropyMeasure<InputType, OutputType>
For analyzing single entities with information-theoretic measures:
- Shannon entropy
- Tsallis entropy with parameter q
- Both standard and optimized implementations
3. DataTransform<InputType, OutputType>
For preprocessing and transforming data:
- Weight factor transformation for spectral data
- Optimized implementations for large datasets
Quick Start
Add to your Cargo.toml
:
[dependencies]
similarity = "0.2.0"
Trait-Based API Examples
use similarity::*;
use similarity::similarity_traits::*;
use similarity::entropy_traits::*;
use similarity::transform_traits::*;
use std::collections::HashSet;
// Similarity between vectors
let a = [1.0, 2.0, 3.0];
let b = [2.0, 4.0, 6.0];
let cosine_sim = CosineSimilarity::similarity((&a, &b));
let euclidean_dist = EuclideanDistance::similarity((&a, &b));
// Set similarity
let mut set1 = HashSet::new();
set1.extend([1, 2, 3]);
let mut set2 = HashSet::new();
set2.extend([2, 3, 4]);
let jaccard = JaccardIndex::similarity((&set1, &set2));
// Entropy of spectral data
let spectrum = Spectrum::from_peaks(vec![
Peak { mz: 100.0, intensity: 0.6 },
Peak { mz: 200.0, intensity: 0.4 },
]);
let shannon_entropy = ShannonEntropy::entropy(&spectrum);
let tsallis_entropy = TsallisEntropy::entropy((&spectrum, 2.0));
// Data transformation
let mzs = [100.0, 200.0, 300.0];
let intensities = [0.5, 0.3, 0.2];
let transformed = WeightFactorTransformation::transform((&mzs, &intensities, 0.5, 2.0));
Functional API (Still Available)
use similarity::*;
// All original functions remain available
let cosine_sim = cosine_similarity(&a, &b);
let euclidean_dist = euclidean_distance(&a, &b);
let entropy = calculate_entropy(&spectrum);
Performance Features
Optional Dependencies
[dependencies]
similarity = { version = "0.2.0", features = ["parallel", "fft"] }
parallel
: Enables Rayon-based parallel processing for large datasetsfft
: Enables FFT-based optimizations for cross-correlation and convolution
Performance Comparison
The trait-based API has zero performance overhead compared to direct function calls:
// These are equivalent in performance:
let result1 = cosine_similarity(&a, &b);
let result2 = CosineSimilarity::similarity((&a, &b));
For large datasets (10K+ elements), use the optimized variants:
let large_a: Vec<f64> = (0..100_000).map(|i| i as f64).collect();
let large_b: Vec<f64> = (0..100_000).map(|i| i as f64 * 1.1).collect();
// 2-3x faster for large vectors
let result = CosineSimilarityOptimized::similarity((&large_a, &large_b));
// Even faster with parallel processing
let result = CosineSimilarityParallel::similarity((&large_a, &large_b));
Examples
Run the comprehensive demo:
cargo run --example trait_demo
This demonstrates all available trait implementations with:
- Similarity and distance metrics
- Entropy measures for spectral data
- Data transformations
- Performance comparisons
- FFT-optimized operations
API Documentation
Trait Definitions
// For comparing two entities
pub trait Similarity<InputType, OutputType> {
fn similarity(input: InputType) -> OutputType;
}
// For analyzing single entities
pub trait EntropyMeasure<InputType, OutputType> {
fn entropy(input: InputType) -> OutputType;
}
// For transforming data
pub trait DataTransform<InputType, OutputType> {
fn transform(input: InputType) -> OutputType;
}
Available Implementations
Similarity Traits:
CosineSimilarity
,CosineSimilarityOptimized
,CosineSimilarityParallel
CosineDistance
,CosineDistanceOptimized
,CosineDistanceParallel
EuclideanDistance
,SquaredEuclideanDistance
PearsonCorrelationDistance
,PearsonCorrelationDistanceOptimized
,PearsonCorrelationDistanceParallel
JaccardIndex
HitRate
,OvershootRate
CrossCorrelationOptimized
,CrossCorrelationParallel
,CrossCorrelationFFTOptimized
TimeShiftFinder
,TimeShiftFinderFFT
EntropySimilarity
,EntropySimilarityOptimized
Entropy Traits:
ShannonEntropy
,ShannonEntropyOptimized
TsallisEntropy
,TsallisEntropyOptimized
Transform Traits:
WeightFactorTransformation
,WeightFactorTransformationOptimized
License
This project is licensed under the MIT License - see the LICENSE file for details.
Contributing
Contributions are welcome. Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
Related Projects
Dependencies
~0.5–1.5MB
~26K SLoC