11 releases
0.3.1 | Sep 20, 2022 |
---|---|
0.3.0 |
|
0.2.6 | Sep 13, 2022 |
0.1.2 | Sep 11, 2022 |
#271 in Compression
45 downloads per month
Used in 3 crates
(2 directly)
64KB
1.5K
SLoC
banzai
banzai is a pure Rust bzip2 encoder. It is currently alpha software, which means that it has undergone a limited amount of testing and should not be relied upon to perform well and not eat your data. That's not to say, however, that I don't care about performance or reliability - bug reports are warmly appreciated! In the long term I would like to get this library to a state where it can be relied upon in production software.
To use banzai as a command-line tool with a similar interface to bzip(1)
, install bnz through cargo.
This library is linear-time in the size of the input, and has no usage of unsafe
. When it is more mature these features should make it a good choice for safety-critical applications.
In general, banzai achieves similar compression ratios to the reference implementation. However, as a general rule, the runtime tends to be approximately twice as long (notwithstanding cases when the reference implementation uses its fallback algorithm, which is substantially slower). I believe this is because the runtime is dominated by the Burrows-Wheeler Transform. Since bzip2 uses a 'wrap-around' version of the BWT, banzai is obliged to compute the suffix array of the input concatenated with itself. I intend to investigate ways in which the redundancy inherent to inputs of this form can be exploited to optimise suffix array construction.
This library does not (currently) include a decompressor. Paolo Barbolini's bzip2-rs offers a pure Rust bzip2 decompressor, though I have not used it myself and cannot vouch for its quality.
Interface
fn encode(reader: R, writer: io::BufWriter<W>, level: usize) -> io::Result<usize>
where
R: io::BufRead,
W: io::Write
Call encode
with a reference to an input buffer and a BufWriter
. The final parameter is level
, which is a number between 1
and 9
inclusive, which corresponds to the block size (block size is level * 100_000
bytes). The typical default is 9
. Returns the number of input bytes encoded.
Acknowledgements
This is original libre software. However, implementation guidance was derived from several free-software sources.
The suffix array construction algorithm used in banzai is SA-IS, which was developed by Ge Nong, Sen Zhang, and Wai Hong Chan. Guidance for implementing SA-IS was derived from Yuta Mori's sais and burntsushi's suffix.
The implementation of Huffman coding used in banzai takes heavy inspiration from the reference implementation of bzip2, originally authored by Julian Seward, currently maintained by Micah Snyder.
Finally, the unofficial bzip2 Format Specification written by Joe Tsai was extremely helpful when it came to the specifics of the bzip2 binary format.
Dependencies
~100KB