### 8 unstable releases (3 breaking)

0.4.0 | Jul 10, 2019 |
---|---|

0.3.0 | Dec 4, 2018 |

0.2.1 | Dec 4, 2018 |

0.2.0 | Jun 27, 2018 |

0.1.3 | Mar 16, 2018 |

#**132** in Data structures

**204** downloads per month

Used in **1** crate

**MIT/Apache**

19KB

156 lines

# histogram-sampler

Sample values from a distribution that is approximated by a histogram.

This crate solves a very specific problem, and it may not be what you want. It is probably easiest to explain visually. Imagine a histogram like this:

` ``0` `|``-``-``-``-``-``-``-``-``-``|` `100`
`10` `|``-``-``-``-``-``-``-``|` `80`
`20` `|``-``-``-``|` `40`
`30` `|``-``-``|` `30`
`40` `|``-``|` `20`
`50` `|``-``|` `18`
`60` `||` `10`
`70` `||` `7`

Now imagine that it was created by taking a sequence of values, counting how many there are of
each, rounding those counts to the nearest ten, and then plotting a histogram of the results.
This crate let's you take data like that, and produce a *generator* that will produce a *new*,
infinte sequence of values such that the distribution of values matches that of the sequence
that produced the histogram.

At this point, you may be wondering: why would I ever want such a thing?

Well, it turns out that you can model a bunch of useful things this way. For example, take the distribution of votes and comments on articles on a site like lobste.rs (which we actually have data for). Imagine that you want to make a workload generator that simulates the workload that such a site would see. In particular, you want to make sure you model some articles being more popular than others, some comments being more popular than others, etc. Think of what that data might look like... Many posts will have 0-10 upvotes, some will have 10-20, fewer will have 20-30, etc. Same for comments. Or upvotes on comments. Or child comments per comment. Or stories submitted by users. You get the idea. And all of these can be represented in the form given above where the sampled values would be story/comment/user ids!

Let's see an example based on some real data:

`//` original histogram data
`let` stories_per_votecount `=` `vec!``[`
`(``0``,` `16724``)``,` `(``10``,` `16393``)``,` `(``20``,` `4601``)``,` `(``30``,` `1707``)``,`
`(``40``,` `680``)``,` `(``50``,` `281``)``,` `(``60``,` `128``)``,` `(``70``,` `60``)``,` `(``80``,` `35``)``,`
`(``90``,` `16``)``,` `(``100``,` `4``)``,` `(``110``,` `4``)``,` `(``120``,` `10``)``,` `(``130``,` `1``)``,`
`(``140``,` `2``)``,` `(``160``,` `1``)``,` `(``210``,` `1``)``,` `(``250``,` `1``)``,` `(``290``,` `1``)``,`
`]``;`
`let` `mut` rng `=` `rand``::`thread_rng`(``)``;`
`let` vote_sampler `=` `Sampler``::`from_bins`(`stories_per_votecount`,` `10` `/*` bin width `*/``)``;`
`for` `_` `in` `0``..` `{`
`use` `rand``::``distributions``::`Distribution`;`
`println!``(``"`vote for `{}``"``,` vote_sampler`.``sample``(``&``mut` rng`)``)``;`
`}`

If you were to run that code (which currently runs forever), you'd see an endless sequence of
votes for various fictitious story ids in the range

(because that's how many
stories there are in the original histogram). If you were to assemble a histogram of vote
counts per story after about 370k votes, that histogram would look `[``0``,` `40650`)*almost* exactly like the
original input histogram! If you sample it earlier or later, it won't, because there will be
fewer/more votes in total, which effectively shifts the data towards lower or higher bins. But,
the distribution across story ids will be the same, so you can keep using this to simulate a
particular skewed workload continuing perpetually.

## Bias with small bin sizes

When your bin size is small (<=~10), the very first bin is actually very small. It only holds
values in the range

, since any larger value would round to 10. Unfortunately, this
also means that there's a relatively high likelihood of accidentally (we are picking randomly
after all) pushing one value that should be in the first bin into the second bin. All it would
take is for us to get that value once or twice more than the expected number of times. And it
turns out that that's actually decently likely. The net effect of this is that the first bin
(again, if the bin width is small) is likely to have too few values (usually by ~5 percentage
points), and the second bucket is likely to have too many values (by the same ~5 percentage
points).`[``0``,` `5`)

#### Dependencies

~395KB