#language-model #document #algorithm #matching #analysis #spans #llm

bin+lib neardup

A library for near-duplicate matching

1 unstable release

0.1.0 Aug 7, 2024

#767 in Text processing

36 downloads per month

MIT license

675KB
644 lines

Fast Near-duplicate Matching   Latest Version

Fast near-duplicate matching is a method for quickly finding near-duplicate spans in a document by utilizing the Rabin-Karp algorithm. This algorithm can be used to count the near-duplicates of a query in a pre-training corpus, facilitating the analysis of memorization in Large Language Models (LLMs).

This repository contains the implementation of the fast near-duplicate matching algorithm and the benchmark for the algorithm in Rust. The core functionalities is provided as a crate neardup.

Method

Fast Near-duplicate Matching

  • Input: Suffix $s$, document $d$, and $n$ of $n$-gram
  • Output: Whether $d$ has a span near-duplicate to $s$

Pseudo-code in Python

def fast_near_duplicate_matching(s: list[int], d: list[int], n: int, threshold: float) -> bool:
    l_s = len(s)
    l_d = len(d)
    H = set(ngram(s, n))
    for i in range(max(l_d - l_s, 0)):
        if d[i:i+n] in H:
            for j in range(max(i - l_s + n, 0), i):
                t = d[j:j+l_s]
                if Jaccard_W(s, t) >= threshold:
                    return True
    return False

You can use fast hash functions like fxhash or rolling hash.

When the size of $n$ of $n$-gram is small, fxhash is faster than rolling hash. However, when the size of $n$ is large, rolling hash is faster than fxhash because the rolling hash can calculate the hash value of the next $n$-gram in $O(1)$ time.

How to run

You can count the near-duplicates of queries in the document (sample queries (2 queries) and documents (100 documents) are in the sample_data folder) by running the following command:

$ cargo run --release -- --search-dir ./sample_data --query-path ./sample_data/query.jsonl --threshold 0.6 --n 10
[2024-08-07T10:59:40Z INFO  neardup] query_list_all: 2
[2024-08-07T10:59:40Z INFO  neardup] search_path_list len: 1
[2024-08-07T10:59:40Z INFO  neardup] path: "./sample_data/pythia-00000-00999.jsonl.gz" start loading token_ids_list
[2024-08-07T10:59:40Z INFO  neardup] loaded token_ids_list
[2024-08-07T10:59:40Z INFO  neardup] query: 0 count: 1
[2024-08-07T10:59:40Z INFO  neardup] query: 1 count: 1
[2024-08-07T10:59:40Z INFO  neardup] path idx: 0 finished
[2024-08-07T10:59:40Z INFO  neardup] count: [1, 1]

Count near-duplicates in the Pythia dataset

You can download the Pythia dataset from here After downloading the dataset, you can convert the dataset to the format that this program can read by running the following command:

$ python scripts/prepare_pythia.py --output_dir path/to/output/folder --pythia_data_path path/to/merged/folder/document

Then, you can count the near-duplicates of queries in the Pythia dataset by running the following command:

$ cargo run --release -- --search-dir path/to/output/folder --query-path path/to/output/folder/query.jsonl --threshold 0.6 --n 10

Benchmark

You can run the benchmark for the three methods:

  • Fast near-duplicate matching with fxhash
  • Fast near-duplicate matching with rolling hash
  • Naive near-duplicate matching
$ cargo bench

References

Dependencies

~6–14MB
~180K SLoC