9 unstable releases (3 breaking)
Uses old Rust 2015
0.4.7  Dec 21, 2017 

0.4.6 

0.4.3  Nov 16, 2017 
0.3.0  Oct 31, 2017 
0.1.0  Oct 29, 2017 
#716 in Data structures
83 downloads per month
Used in 3 crates
(via kmerutils)
69KB
1K
SLoC
Wavelet Matrix for Rust language
It provides the various analytics on very large sequence of unsigned integers in constant time.
Usage
After adding to Cargo.toml, add this line to lib.rs or main.rs.
extern crate wavelet_matrix;
See crate document top for further examples.
Benchmarks

Overall Performance
 https://github.com/sekineh/waveletmatrixrs/blob/master/BENCH.md
 Bench is performed on Intel Xeon 2800 MHz

Error evaluation of O(1) SUM
Features
Given an unsigned integer sequence T, it provides the following queries.
Basic operations
.len()
: Returns the length of
T
.  It's almost no overhead, just returning the stored value.
 Returns the length of
.lookup(pos)
: Returns the value at the position of T,
T[pos]
.  It's slower than array lookup but still is O(1). You might want to save the original Vector for faster lookup.
 Returns the value at the position of T,
Counting
Counting is performed in O(1).
.count(start..end, value)
: Returns the number of the element
e
which satisfiese == value
included inT[start..end]
 Returns the number of the element
.count_prefix(start..end, value, ignore_bit)
: Returns the number of the element
e
which satisfiese >> ignore_bit == value >> ignore_bit
included inT[start..end]
 This will be useful for counting the number of IPv4 address that satisfies IPv4 prefix such as
192.168.10.0/24
. In this case, the ignore_bit will be 8.
 Returns the number of the element
.count_lt(start..end, value)
: Returns the number of the element
e
which satisfiese < value
included inT[start..end]
 Returns the number of the element
.count_gt(start..end, value)
: Returns the number of the element
e
which satisfiese > value
included inT[start..end]
 Returns the number of the element
.count_range(start..end, val_start..val_end)
: Returns the number of the element
e
which satisfiesval_start <= e < val_end
included inT[start..end]
 Returns the number of the element
Searching
Searching is performed in O(1) per a next index.
.search(start..end, value)
: Returns the iterator that find indexes of the element
e
which satisfiese == value
included inT[start..end]
 Returns the iterator that find indexes of the element
.search_prefix(start..end, value, ignore_bit)
: Returns the iterator that find indexes of the element
e
which satisfiese >> ignore_bit == value >> ignore_bit
included inT[start..end]
 Returns the iterator that find indexes of the element
 [TODO] implement various conditions other than equal.
Ranking
Ranking is performed in roughly O(1) with regard to the number of elements n
.
.max_k(start..end, val_start..val_end, k)
: list the
(value, count)
pairs in descending order.  values are constrained to the range
val_start..val_end
.
 list the
.min_k(start..end, val_start..val_end, k)
: list the
(value, count)
pairs in ascending order.  values are constrained to the range
val_start..val_end
.
 list the
.top_k()
is also performed in O(1) in best case, but may take O(n) in the worst case where every value occurs only once!
.top_k(start..end, val_start..val_end, k)
: list the
(value, count)
pairs in mostfrequentonefirst order.  values are constrained to the range
val_start..val_end
.  [TODO] clarify the order of same count.
 list the
To achieve O(1) performance regardless of the number of unique values, use .top_k_ranges()
instead:
 [EXPERIMENTAL]
.top_k_ranges(start..end, val_start..val_end, k)
: list the
(v_start..v_end, count)
pairs in mostfrequentonefirst order.  unlike
.top_k()
,.top_k_ranges()
returns the exhaustive range set that covers all of the values.  values are constrained to the range
val_start..val_end
.  [TODO] clarify the order of same count.
 list the
Statistics
O(1) Median / O(1) Quantile
.median(start..end)
: Returns the median value from
T[start..end]
.  same as
.quantile(start..end, start + (end  start) / 2)
.
 Returns the median value from
.quantile(start..end, k)
: Returns the (k+1)th smallest value from
T[start..end]
.
 Returns the (k+1)th smallest value from
O(1) Sum / O(1) Average / O(1) Variance
Experiment 1
These methods use .top_k_ranges()
to enumerate the most relevant value ranges.
They are not as accurate as the method used in Experiment 2.
 [EXPERIMENTAL]
.sum_experiment1(start..end, val_start..val_end, k)
: Approximately calculate the sum of
T[start..end]
using up tok
wavelet tree nodes.  Only values included in the range
val_start..val_end
are processed.  To get the exact result, specify
k = m + 1
wherem
is the number of values which are unique.
 Approximately calculate the sum of
 [EXPERIMENTAL]
.mean_experiment1(start..end, val_start..val_end, k)
: Approximately calculate the average of
T[start..end]
using up tok
wavelet tree nodes.  Only values included in the range
val_start..val_end
are processed.  To get the exact result, specify
k = m + 1
wherem
is the number of values which are unique.
 Approximately calculate the average of
 [EXPERIMENTAL]
.variance_experiment1(start..end, val_start..val_end, k)
: Approximately calculate the variance of
T[start..end]
using up tok
wavelet tree nodes.  Only values included in the range
val_start..val_end
are processed.  To get the exact result, specify
k = m + 1
wherem
is the number of values which are unique.
 Approximately calculate the variance of
Experiment 2
Improvement over Experiment 1. They use custom node enumerator to minimize the error.
 [EXPERIMENTAL]
.sum_experiment2(start..end, val_start..val_end, k)
:
Experiment 3
Improvement over Experiment 2. They use Range<u64>
to tell how accurate the computed value is.
 [EXPERIMENTAL]
.sum_experiment3(start..end, val_start..val_end, k)
:
Classical WaveletMatrix operations
.rank(pos, value)
: counts value included inT[0..pos]
. Note: pos is exclusive. When pos is 0,
.rank()
always returns 0.
 Note: pos is exclusive. When pos is 0,
.select(rank, value)
: return the position of the (rank+1)th value Note: When found nothing, it returns
.len()
instead of None.
 Note: When found nothing, it returns
Releases
v0.4.7
 Fix test bug #4
 Fix test bug #3
v0.4.6
 Add test for
.median()
and.quantile()
.  Add index range check.
v0.4.5
 Fix dependent crate.
v0.4.4
 Add
.median()
and.quantile()
. They are quite fast, only take 35 us on 16bit values.  Add
.top_k_ranges()
which is faster than.top_k()
in worst case.  Add
.sum_experiment1()
,.mean_experiment1()
and.variance_experiment1()
.  Add
.sum_experiment2()
.  Add
.sum_experiment3()
.  Add
BENCH.md
bench report.  Add
BENCH_SUM.md
bench report.
v0.4.3
 Add
.bit_len()
.  Add
.dim()
.  Move examples to crate's document top.
 Add test for
.top_k()
,.max_k()
andmin_k()
.  Suppress warnings.
v0.4.2
 Add
.top_k()
,.max_k()
andmin_k()
.
v0.4.1
 Add
.search()
and.search_prefix()
.
v0.4.0
 Add
.count()
,.count_prefix()
,.count_lt()
,.count_gt()
and.count_range()
.  [INCOMPATIBLE] WaveletMatrix::new() takes
&Vec<u64>
, instead ofVec<u64>
v0.3.0
 [INCOMPATIBLE] .select() now returns .len() instead of None.
TODO

Add Benchmark.
 Implement same queries using trivial algorithm
 Compare wm's queries against trivial one.
 Make a nice plot.

Profiling

Optimize underlying rsdic structure.

Add travis CI.

Add u128 support or arbitrarylength integer support

The fastest implementation on the planet
Credits

A Go package for myriad array operations using wavelet trees
 https://github.com/hillbig/waveletTree
 Basically, the algorithm is deeply derived from the above Go implementation.

excellent slides in Japanese

The original inventor's pdf:
Dependencies
~440KB