tlsh-fixed

Rust port of Trend Micro Locality Sensitive Hashing

1 unstable release

0.1.1 Feb 24, 2023

#1387 in Algorithms

Download history 414/week @ 2024-06-17 563/week @ 2024-06-24 388/week @ 2024-07-01 456/week @ 2024-07-08 892/week @ 2024-07-15 594/week @ 2024-07-22 748/week @ 2024-07-29 1146/week @ 2024-08-05 515/week @ 2024-08-12 506/week @ 2024-08-19 890/week @ 2024-08-26 527/week @ 2024-09-02 1407/week @ 2024-09-09 914/week @ 2024-09-16 931/week @ 2024-09-23 847/week @ 2024-09-30

4,162 downloads per month
Used in 5 crates (4 directly)

BSD-3-Clause OR Apache-2.0

60KB
797 lines

TLSH-RS

Crates.io Documentation Build

This is a Rust port of Trend Micro Locality Sensitive Hash (TLSH) [github], [website] algorithm to compute a hash value of a byte stream. These generated hash values can then be used for similarity detection, data clustering or nearest neighbour search.

Overview

The algorithm to construct a TLSH digest is as follows (for more detail, see [1]):

  • Step 1: processes an input stream by using a sliding window of length 5 and populates the hash buckets. Each triplet is passed through a hash function (in this implementation, the hash function is the Pearson hashing).
  • Step 2: calculates the quartile points from the hash bucket obtained in step 1. This step might requires the sorting of the bucket array: q1: the lowest 25% of the array q2: the lowest 50% of the array q3: the lowest 75% of the array
  • Step 3: computes the digest header. The first three bytes of a hash is reserved for the header. The header of a TLSH hash consists of three parts:
    • The first byte is a checksum (with some modulo) of the byte string
    • The second byte is computed from the logarithm of the byte string's length (with some modulo)
    • The third byte is the result of q1_ratio <<< 4 | q2_ratio, where q1_ratio = (q1 * 100 / q3) MOD 16 q2_ratio = (q2 * 100 / q3) MOD 16
  • Step 4: constructs the digest body from the bucket array. Note: in this step, the reversing order in reading the bucket is assumed. This means, the last element is read first while the first is read last. Their value is converted into hex form and appended into the final hash value.

Examples

The example examples/tlsh_files.rs shows how we calculate hash values from files and measure their difference (distance). To run the example, use the following command in command line:

cargo run --release --example tlsh_files ../path/to/folder/with/files

References

J. Oliver, C. Cheng and Y. Chen (2013). "TLSH - A Locality Sensitive Hash" [pdf].

License

TLSH is provided for use under two licenses: Apache OR BSD. Users may opt to use either license depending on the license restictions of the systems with which they plan to integrate the TLSH code.

No runtime deps