#sampling #random #probabilistic #line #numbers #file #shuf

app randlines

Similar to shuf(1), but probabilistic and minimalistic with respect to memory

4 releases

0.1.3 Nov 19, 2020
0.1.2 Nov 19, 2020
0.1.1 Nov 18, 2020
0.1.0 Nov 18, 2020

#44 in #probabilistic

MIT license

6KB
69 lines

randlines

crates.io

Print out random number of lines from a line oriented file. Pick up where shuf gets killed.

Installation

$ cargo install randlines

Usage

$ randlines -h
randlines 0.1.1

Emit a random subset of lines from a file. This is a probabilistic program, you
will not get exactly `n` lines.

Typically, you can use shuf(1) which uses reservoir sampling and is very
efficient. However, if we want to extract 10M random lines from a file of 100M
lines, shuf(1) might be killed. However, randlines will not shuffle lines, just
skip over random number of lines.

USAGE:
    randlines [OPTIONS] [input]

FLAGS:
    -h, --help       Prints help information
    -V, --version    Prints version information

OPTIONS:
    -n <n>                          [default: 16]
    -s, --size-hint <size-hint>

ARGS:
    <input>

Emit a random subset of lines from a file. This is a probabilistic program, you will not get exactly n lines.

Typically, you can use shuf(1) which uses reservoir sampling and is very efficient. However, if we want to extract 10M random lines from a file of 100M lines, shuf(1) might be killed. However, randlines will not shuffle lines, just skip over random number of lines.

TODO

  • compress temporary output when reading from stdin

Dependencies

~5–14MB
~179K SLoC