2 releases
0.1.1 | Dec 27, 2023 |
---|---|
0.1.0 | Dec 27, 2023 |
#2207 in Algorithms
14KB
143 lines
bufhash
Buffered hashing utilities for Rust.
Overview
See the documentation for details on the package and how to write a buffered hasher.
lib.rs
:
Buffered hashing facilities.
This crate provides facilities for writing hashers that take advantage of buffering
in various ways. Most notably, it provides the concept of partitioned hashers,
which are a class of hashers that split data into fixed-sized chunks before hashing.
This crate, and all of the facilities therein, are written in 100% Rust and are
designed to integrate with the existing Rust traits, such as std::hash::Hasher
.
Overview
Partitioned hashers are a class of hashers that work on fixed-size chunks of data
(for example, MurmurHash3). Fundamentally, the facilities provided under the
partitioned
namespace exist to solve the problem of accounting while
partitioning data during incremental hashing. Underneath the hood, partitioned
hashers keep track of chunking streaming inputs into the appropriate sizes and
storing excess data within an internal buffer to be used in the next
write()
call.
For example, consider the following invocation of a fictional hashing algorithm that works in chunks of four (4) bytes at a time:
let mut hasher = MyHasher::default();
hasher.write(b"Hi,");
hasher.write(b"there ");
hasher.write(b"world!");
println!("Result: 0x{:X}", hasher.finish());
In this example, the data being hashed is incrementally fed to the hasher over three
different write()
calls. Recalling that our hasher
works on four byte chunks, the first three characters in the byte string literal
(i.e., b"Hi,"
) are not sufficient to fill out a full block of data to be hashed.
However, we similarly do not want to treat these three characters as if they are the
end of the data stream, as more data may be coming (and, in fact, is in our
example. With the facilities provided by the Rust standard library, it is up to the
implementer to keep track of these various edge cases and ensure the hasher works
appropriately.
To ease implementation of hashers that fit the aforementioned profile, partitioned hashers take care of the necessary accounting behind the scenes and allow the implementer to simply focus on a fixed-size interface for hashing.
Implementing a Partitioned Hasher
Let's look at the core bufhash::partitioned::Hasher
trait to ascertain how writing partitioned hashers is different than writing an
implementation for a std::hash::Hasher
.
pub trait Hasher<const S: usize> {
fn write(&mut self, bytes: &[u8; S]);
fn finish(&self, bytes: &[u8]) -> u64;
}
You'll find the same two methods as provided in std::hash::Hash
with a few,
minor (but important!) differences:
- The trait has a generic parameter
const S: usize
, which defines the number of bytes per partition. In the case of our example above, we would set this to four (4) bytes. - The
write()
method takes a fixed-size array of bytes (&[u8; S]
). Thebytes
parameter is guaranteed to be exactly the size of the generic parameter, allowing the implementer to forgo size checking. - The
finish()
method also provides abytes
parameter as a byte slice. This parameter are the bytes left over in the internal buffer that need to be handled before the hasher is finished: it is guaranteed to have a size less than the generic parameter.
As an example, let's implement a partitioned hasher that (a) interprets the byte
stream as little-endian u64
s that must be added together and then (b) shifts the
result left by the number of bytes remaining in the buffer when finish()
is called
(not a particularly good hashing algorithm, but it will work for our demonstration
purposes).
#[derive(Debug, Default)]
pub struct Simple(u64);
impl bufhash::partitioned::Hasher<8> for Simple {
fn write(&mut self, bytes: &[u8; 8]) {
let data = u64::from_le_bytes(*bytes);
self.0 = self.0.wrapping_add(data);
}
fn finish(&self, bytes: &[u8]) -> u64 {
self.0 << bytes.len()
}
}
As you can see, this is much more straightforward to implement and maintain that handling the accounting yourself.
To use this new implementation with the std::hash::Hasher
interface, you can use
the PartitionedHasher
adapter like so:
use std::hash::Hasher as _;
#
let mut hasher = bufhash::PartitionedHasher::new(Simple::default());
hasher.write(b"Hello, world!");
println!("Result: {:#X}", hasher.finish());