1 unstable release

0.1.0 Dec 28, 2020

#17 in #data-query

Download history 25/week @ 2024-01-07 26/week @ 2024-01-14 18/week @ 2024-01-21 22/week @ 2024-01-28 24/week @ 2024-02-04 24/week @ 2024-02-11 13/week @ 2024-02-18 25/week @ 2024-02-25 3/week @ 2024-03-03 37/week @ 2024-03-10 30/week @ 2024-03-17

96 downloads per month

MIT license

25KB
529 lines

MASS: Mueen's Algorithm for Similarity Search in Rust!

Similarity search for time series subsequences is THE most important subroutine for time series pattern mining. Subsequence similarity search has been scaled to trillions obsetvations under both DTW (Dynamic Time Warping) and Euclidean distances [a]. The algorithms are ultra fast and efficient. The key technique that makes the algorithms useful is the Early Abandoning technique [b,e] known since 1994. However, the algorithms lack few properties that are useful for many time series data mining algorithms.

  1. Early abandoning depends on the dataset. The worst case complexity is still O(nm) where n is the length of the larger time series and m is the length of the short query.
  2. The algorithm can produce the most similar subsequence to the query and cannot produce the Distance Profile to all the subssequences given the query. MASS is an algorithm to create Distance Profile of a query to a long time series. In this page we share a code for The Fastest Similarity Search Algorithm for Time Series Subsequences under Euclidean Distance. Early abandoning can occasionally beat this algorithm on some datasets for some queries. This algorithm is independent of data and query. The underlying concept of the algorithm is known for a long time to the signal processing community. We have used it for the first time on time series subsequence search under z-normalization. The algorithm was used as a subroutine in our papers [c,d] and the code are given below.
  1. The algorithm has an overall time complexity of O(n log n) which does not depend on datasets and is the lower bound of similarity search over time series subsequences.
  2. The algorithm produces all of the distances from the query to the subsequences of a long time series. In our recent paper, we generalize the usage of the distance profiles calculated using MASS in finding motifs, shapelets and discords.

Excerpt taken from:

@misc{
FastestSimilaritySearch,
title={The Fastest Similarity Search Algorithm for Time Series Subsequences under Euclidean Distance},
author={ Mueen, Abdullah and Zhu, Yan and Yeh, Michael and Kamgar, Kaveh and Viswanathan, Krishnamurthy and Gupta, Chetan and Keogh, Eamonn},
year={2017},
month={August},
note = {\url{http://www.cs.unm.edu/~mueen/FastestSimilaritySearch.html}}
}

Features

"jemalloc" enable jemallocator as memory allocator.

"pseudo_distance" simplifies the distance with the same optimization goal for increased performance. The distance output is no longer the MASS distance but a score with the same optimum.

"auto" uses all logical cores to parallelize batch functions. Enabled by default. Disabling this feature exposes ['init_pool()`] to init the global thread pool.

Panics

TODO

Examples

use rand::{thread_rng, Rng};

let mut rng = thread_rng();
let ts = (0..10_000).map(|_| rng.gen()).collect::<Vec<f64>>();
let query = (0..500).map(|_| rng.gen()).collect::<Vec<f64>>();
let res = super_mass::mass_batch(&ts[..], &query[..], 501, 3);
 //top_matches (only the best per batch considered) tuples of (index,distance score).
dbg!(res);
use rand::{thread_rng, Rng};

let mut rng = thread_rng();
let ts = (0..10_000).map(|_| rng.gen()).collect::<Vec<f64>>();
let query = (0..500).map(|_| rng.gen()).collect::<Vec<f64>>();
let res = super_mass::mass(&ts[..], &query[..]);
 //Complete distance profile
dbg!(res);

Dependencies

~6.5MB
~127K SLoC