#reservoir #stream #log #statistics #log-file #sample #input-file

app rs-tool

A command-line tool to perform reservoir sampling on a file or a stream

2 releases

0.1.1 Aug 4, 2024
0.1.0 Aug 4, 2024

#237 in Text processing

Download history 135/week @ 2024-07-30 59/week @ 2024-08-06

120 downloads per month

MIT license

19KB
406 lines

rs-tool: A Tool for Reservoir Sampling

rs-tool processes a log file or a stream of line-delimited records from stdin. It uses reservoir sampling to produce a sample of its input on a per-record or per-field basis. It prints its output to stdout in either tabular or JSON format.

Given a suitable log file, you can use rs-tool to answer questions like:

  • what are the most common IP addresses that access my web site?
  • which users use the sudo command the most?
  • what are the busiest times of day for my service?

When rs-tool reads its input from a file, it uses the Rayon parallelism library to construct and merge reservoirs in parallel.

Inspired by Tim Bray's tf.

Dependencies

~7–16MB
~183K SLoC