#index #dataset #sparse #embedding #query #retrieval #inverted

nightly bin+lib seismic

Seismic is designed for effective and efficient KNN retrieval over learned sparse embeddings

1 unstable release

0.1.0 Jul 4, 2024

#389 in Science

MIT license

730KB
2.5K SLoC

Rust 2K SLoC // 0.1% comments Python 289 SLoC // 0.1% comments Shell 38 SLoC // 0.4% comments Jupyter Notebooks 31 SLoC // 0.3% comments

Seismic

Seismic is designed for effective and efficient retrieval over learned sparse embeddings. Pleasantly, the design uses in a new way two familiar data structures: the inverted and the forward index. The approach organizes inverted lists into geometrically-cohesive blocks. Each block is equipped with a sketch, serving as a summary of the vectors contained in it. The summaries allow us to skip over a large number of blocks during retrieval and save substantial compute. When a summary indicates that a block must be examined, we use the forward index to retrieve exact embeddings of its documents and compute inner products.

The figure below gives an overview of the overall design.

The design of Seismic.

Experimental results show that single-threaded query processing using Seismic, reaches sub-millisecond per-query latency on various sparse embeddings of the MSMarco dataset while maintaining high recall. The results indicate that Seismic is one to two orders of magnitude faster than state-of-the-art inverted index-based solutions and further outperforms the winning (graph-based) submissions to the BigANN Challenge by a significant margin.

See the paper [1] for more details.

Experimental Results

We report here a comparison with other state-of-the-art indexes for sparse vectors. See the paper [1] for more detailed experimental evaluation.

Comparison of Seismic with sparse kANN state-of-the-art.

How to Replicate the Experiments

To run the experiments with Seismic, we need to compile the binary executables using:

RUSTFLAGS="-C target-cpu=native" cargo build --release

This command produces three executables: build_inverted_index, perf_inverted_index, and generate_groundtruth in the /target/release/ directory.

The build_inverted_index executable is used to construct an inverted index for a dataset. Both dataset and query files are stored in an internal binary format. Refer to the Python scripts section for a script to convert a dataset from JSON format. This process involves several parameters that regulate space/time trade-offs:

  • --n-postings: Regulates the size of the posting list, representing the average number of postings stored per posting list.
  • --summary-energy: Controls the size of the summaries, preserving a fraction of the overall energy for each summary.
  • --centroid-fraction: Determines the number of centroids built for each posting list, capped at a fraction of the posting list length.

For Splade on MSMarco, good choices are --n-postings 3500, --summary-energy 0.4, and --centroid-fraction 0.1.

The following command can be used to create a Seismic index serialized in the file splade.bin.3500.seismic:

./target/release/build_inverted_index -i splade.bin -o splade.bin.3500_0.4_0.1 --centroid-fraction 0.1 --summary-energy 0.4 --n-postings 3500 

To execute a set of queries, use the perf_inverted_index executable. Two parameters, query-cut and heap-factor, trade-off efficiency vs accuracy:

  • --query-cut: The search algorithm considers only the top query_cut components of the query.
  • --heap-factor: The search algorithm skips a block whose estimated dot product is greater than heap_factor times the smallest dot product of the top-k results in the current heap.

The following command exemplifies this:

./target/release/perf_inverted_index -i splade.bin.3500_0.4_0.1c -q splade_queries.bin -o results.tsv --query-cut 5 --heap-factor 0.7

The dataset of queries is in binary internal format. Refer again to the Python scripts section for a script to convert a dataset from JSON format.

The executable prints the average running time per query. Queries are executed in single-thread mode. To enable multithreading, modify the Rust code by replacing the iteration on the query from queries.iter().take(n_queries).enumerate() to queries.par_iter().take(n_queries).enumerate().

The results are written in the file results.tsv. For each query, there are k lines, one for each of its results. Each line follows this format:

query_id\tdocument_id\tresult_rank\tdot_product

Here, query_id is a progressive identifier for the query, document_id is the identifier of the document in the indexed dataset, and result_rank indicates their rank in the ordering by their dot_product with the query.

To evaluate the accuracy of the retrieved results against an already computed ground truth, use the Python script scripts/accuracy.py:

python3 scripts/accuracy.py groundtruth.tsv results.tsv

This will output the recall percentage.

The ground truth for a dataset can be computed with generate_groundtruth as follows:

./target/release/generate_groundtruth -i splade.bin -q splade_queries.bin -o groundtruth.tsv

The exact top-10 results of each query are written in the file groundtruth.tsv with the format described above.

Even if multithreading is enabled here, the execution may take a considerable amount of time due to the brute-force exact query algorithm that scans the entire dataset for each query.

Seismic Parameters

The table below reports the building parameters we used for the different datasets.

Dataset --n-postings --centroid-fraction --summary-energy
MSMARCO-Splade 4000 0.1 0.4
MSMARCO-Effsplade 4000 0.1 0.4
MSMARCO-UniCOIL-T5 4000 0.1 0.4
NQ-Splade 3500 0.15 0.5

The table below reports the query parameters we used for the different datasets and various levels of accuracy.

Results may change slightly if you re-create the indexes, due to the random selection of the centroids of Seismic.

Splade on MSMARCO

$hf$ $q_c$ Time [ $\mu s$ ] Accuracy@10
0.9 3 187 90.49
0.9 4 206 92.27
0.9 4 206 92.27
0.9 5 222 93.13
0.9 8 269 94.07
0.8 5 303 95.69
0.8 6 348 96.11
0.7 6 531 97.17

E-Splade on MSMARCO

$hf$ $q_c$ Time [ $\mu s $] Accuracy@10
1 3 222 90.99
1 4 222 90.99
1 4 239 93.26
1 4 239 93.26
1 6 256 94.17
0.9 4 376 95.95
0.9 5 383 96.53
0.8 5 581 97.47

Unicoil-T5 on MSMARCO

$hf$ $q_c$ Time [ $\mu s$ ] Accuracy@10
1 3 115 90.04
1 4 123 91.33
1 6 133 92.07
0.9 3 168 94.03
0.9 3 168 94.03
0.9 4 180 95.13
0.8 4 268 96.76
0.8 5 280 97.19

NQ on MSMARCO

$hf$ $q_c$ Time [ $\mu s$ ] Accuracy@10
1 3 195 91.25
1 4 195 91.25
1 6 211 92.23
0.9 3 240 93.24
0.9 3 266 95.17
0.9 4 266 95.17
0.8 4 286 96.26
0.8 5 362 97.18

We provide a script to explore the search parameters given a trained index; the script is scripts/grid_search_only_accuracy.sh. You can use it as follows:

index_path=""
results_file_path=""
queries_path=""
gt_path=""

bash scripts/grid_search_only_accuracy.sh $index_path $results_file_path $queries_path $gt_path

The script writes the result of the grid search in results_file_path. To get the fastest configuration at each accuracy cut, simply run

python scripts/extract_results.py --file-path $results_file_path

Python Scripts

Download the Datasets

The embeddings in jsonl format used in our experiments can be downloaded from this HugginFace repository, together with the queries representations.

As an example, the Splade embeddings for MsMarco can be downloaded and extracted by runnning the following commands.

wget https://huggingface.co/datasets/tuskanny/seismic-msmarco-splade/resolve/main/documents.tar.gz?download=true -O documents.tar.gz 

tar -xvzf documents.tar.gz

or by using the Huggingface dataset download tool.

Convert the Data

Documents and queries should have the following format. Each line should be a JSON-formatted string with the following fields:

  • id: must represent the ID of the document as an integer.
  • content: the original content of the document, as a string. This field is optional.
  • vector: a dictionary where each key represents a token, and its corresponding value is the score, e.g., {"dog": 2.45}.

This is the standard output format of several libraries to train sparse models, such as learned-sparse-retrieval.

The script convert_json_to_inner_format.py allows converting files formatted accordingly into the seismic inner format.

python scripts/convert_json_to_inner_format.py --document-path /path/to/document.jsonl --queries-path /path/to/queries.jsonl --output-dir /path/to/output 

This will generate a data directory at the /path/to/output path, with documents.bin and queries.bin binary files inside.

If you download the NQ dataset from the HuggingFace repo, you need to specify --input-format nq as it uses a slightly different format.

Using the Rust Code

To incorporate the Seismic library into your Rust project, navigate to your project directory and run the following Cargo command:

cargo add seismic

This command adds the Seismic library to your project.

Creating a Toy Dataset

Let's create a toy dataset comprising vectors with f32 values. Next, we'll convert this dataset to use half-precision floating points (half::f16). Finally, we'll check the number of vectors, the dimensionality, and the number of non-zero components of the dataset.

use seismic::SparseDataset;
use half::f16;

let data = vec![
                (vec![0, 2, 4],    vec![1.0, 2.0, 3.0]),
                (vec![1, 3],       vec![4.0, 5.0]),
                (vec![0, 1, 2, 3], vec![1.0, 2.0, 3.0, 4.0])
                ];

let dataset: SparseDataset<f16> = data.into_iter().collect::<SparseDataset<f32>>().into();;

assert_eq!(dataset.len(), 3);  // Number of vectors  
assert_eq!(dataset.dim(), 5);  // Number of components
assert_eq!(dataset.nnz(), 9);  // Number of non zero components

The following code shows how to read a dataset in the internal binary format with f32 values and quantize those values to f16.

let dataset = SparseDataset::<f32>::read_bin_file(&input_filename).unwrap().quantize_f16();

Building and Querying an Index

Let's build an index using the above toy dataset and search for a query.

use seismic::inverted_index::{Configuration,InvertedIndex};
use seismic::SparseDataset;

use half::f16;

let data = vec![
                (vec![0, 2, 4],    vec![1.0, 2.0, 3.0]),
                (vec![1, 3],       vec![4.0, 5.0]),
                (vec![0, 1, 2, 3], vec![1.0, 2.0, 3.0, 4.0])
                ];

let dataset: SparseDataset<f16> = data.into_iter().collect::<SparseDataset<f32>>().into();;

let inverted_index = InvertedIndex::build(dataset, Configuration::default());

let result = inverted_index.search(&vec![0, 1], &vec![1.0, 2.0], 1, 5, 0.7);

assert_eq!(result[0].0, 8.0);
assert_eq!(result[0].1, 1);

There are building configuration parameters to experiment with. Take a look at build_inverted_index.rs code for an example.

The most important ones are

  • n_postings in PruningStrategy::GlobalThreshold: Regulates the size of the posting list, representing the average number of postings stored per posting list.
  • summary_energy in SummarizationStrategy::EnergyPerserving: Controls the size of the summaries, preserving a fraction of the overall energy for each summary.
  • centroid_fraction in BlockingStrategy::RandomKmeans: Determines the number of centroids built for each posting list, capped at a fraction of the posting list length.

Refer to Seismic parameters for recommended values for different datasets.

Take a look at build_inverted_index.rs and perf_inverted_index.rs for examples to serialize/deserialize an index on a file.

The signature of the search method is

pub fn search(
        &self,
        query_components: &[u16],
        query_values: &[f32],
        k: usize,
        query_cut: usize,
        heap_factor: f32,
    ) -> Vec<(f32, usize)>

It accepts a sparse vector for the query (query_components and query_values), k for top-k results, and parameters query_cut and heap_factor for trade-off between accuracy and query time.

  • query_cut: The search algorithm considers only the top query_cut components of the query.
  • heap_factor: The search algorithm skips a block whose estimated dot product is greater than heap_factor times the smallest dot product of the top-k results in the current heap.

Refer to Seismic parameters for their influence on recall and query time on the different datasets.

Using the Python Interface

We have also included a Python interface for convenience. It is fairly straightforward to build the Python interface with maturin as follows:

pip install maturin
RUSTFLAGS="-C target-cpu=native" maturin build -r
for whl in target/wheels/*.whl; do pip3 install $whl; done

Confirm that the installation was successful by importing seismic, building an index, and querying it:

from seismic import PySeismicIndex

# We assume that the sparse dataset is in the internal format.
index = PySeismicIndex.build(
    input_file,
    n_postings=3500,
    centroid_fraction=0.1,
    truncated_kmeans_training=False,
    truncation_size=16,
    min_cluster_size=2,
    summary_energy=0.4)
    
# You can serialize and store the index in a file.
index.save(index_path)

# You may later load the index to query it.
index = PySeismicIndex.load(index_path)

# Search can be done either for a single query. 
results: List[Tuple[float, int]] = index.search(
    query_components=np.array([...], dtype=np.int32),
    query_values=np.array([...], dtype=np.float32),
    k, query_cut, heap_factor)

# You may also (concurrently) search the index with a batch of
# queries. Assuming the queries are stored in the internal format,
# you can invoke the following function:

results: List[List[Tuple[float, int]]] = index.batch_search(
    query_path, k, query_cut, heap_factor, num_threads)

Bibliography

  1. Sebastian Bruch, Franco Maria Nardini, Cosimo Rulli, Rossano Venturini. "Efficient Inverted Indexes for Approximate Retrieval over Learned Sparse Representations." In ACM SIGIR. 2024.

Citation License

The source code in this repository is subject to the following citation license:

By downloading and using this software, you agree to cite the under-noted paper in any kind of material you produce where it was used to conduct a search or experimentation, whether be it a research paper, dissertation, article, poster, presentation, or documentation. By using this software, you have agreed to the citation license.

@inproceedings{Seismic,
  author = {Sebastian Bruch and Franco Maria Nardini and Cosimo Rulli and Rossano Venturini},
  title = {Efficient Inverted Indexes for Approximate Retrieval over Learned Sparse Representations},
  booktitle = {The 47th International ACM SIGIR Conference on Research and Development in Information Retrieval ({SIGIR})},
  publisher = {ACM},
  year = {2024}
}

Dependencies

~10–19MB
~256K SLoC