1 unstable release

0.1.0 Oct 10, 2024

#628 in Algorithms

Download history 167/week @ 2024-10-07 16/week @ 2024-10-14

183 downloads per month

MIT/Apache

99KB
183 lines

Remedian

Remedian is a Rust implementation of The Remedian, a robust method to approximate the median of a large dataset, without needing to load the entire thing in memory. This is desirable in cases where the dataset is so large that loading the whole thing simultaneously is intractable.

Basic usage:

// The default block is configured with a reasonable b and k for most applications
let mut remedian = RemedianBlock::default();

// Read data points from our data source, and fold them into the remedian
for data_point in some_data_stream {
		remedian.add_sample_point(data_point);
}

// Get our (approximate) answer
let median = remedian.median();

For more details, check out examples/minimal.rs, examples/full.rs, or examples/custom_data.rs

The Remedian

The Remedian algorithm is quite simple in concept. It stores k arrays of size b, (where k and b are hyperparameters). Here, we see an example with 4 arrays of size 11:

Image: figure 1 from the original paper, showing the four arrays with arrows pointing from one array to the next

Figure 1 from the original paper

As points are read in from the stream, each array is filled in turn:

  1. The first 11 sample points fill the first array
  2. The median is calculated for array 1, and this intermediate median is stored in the first cell of array 2
  3. The next 11 sample points refill the first array (retaining the allocation)
  4. The median is again calculated for array 1, and this new intermediate median is stored in the second cell of array 2
  5. This process continues, progressively filling each array, until the final array is completely full of data
  6. Then, the median of the final array is calculated and returned

In this way, The Remedian can account for b^k sample points, while only using b*k space.

Crate features

This crate has a single feature, which is enabled by default

  • logging: Enables use of the log crate for warnings. If disabled, eprintln! is used instead

Testing

A sample file of 2000 randomly generated numbers can be found in test_data/2000_values.txt. The values are uniformly distributed from 1 to 1000, and are used in tests to ensure accuracy.

Dependencies