1 unstable release
0.1.0 | Jul 4, 2024 |
---|
#381 in Science
730KB
2.5K
SLoC
Seismic
Seismic is designed for effective and efficient retrieval over learned sparse embeddings. Pleasantly, the design uses in a new way two familiar data structures: the inverted and the forward index. The approach organizes inverted lists into geometrically-cohesive blocks. Each block is equipped with a sketch, serving as a summary of the vectors contained in it. The summaries allow us to skip over a large number of blocks during retrieval and save substantial compute. When a summary indicates that a block must be examined, we use the forward index to retrieve exact embeddings of its documents and compute inner products.
The figure below gives an overview of the overall design.
Experimental results show that single-threaded query processing using Seismic, reaches sub-millisecond per-query latency on various sparse embeddings of the MSMarco dataset while maintaining high recall. The results indicate that Seismic is one to two orders of magnitude faster than state-of-the-art inverted index-based solutions and further outperforms the winning (graph-based) submissions to the BigANN Challenge by a significant margin.
See the paper [1] for more details.
Experimental Results
We report here a comparison with other state-of-the-art indexes for sparse vectors. See the paper [1] for more detailed experimental evaluation.
How to Replicate the Experiments
To run the experiments with Seismic, we need to compile the binary executables using:
RUSTFLAGS="-C target-cpu=native" cargo build --release
This command produces three executables: build_inverted_index
, perf_inverted_index
, and generate_groundtruth
in the /target/release/
directory.
The build_inverted_index
executable is used to construct an inverted index for a dataset. Both dataset and query files are stored in an internal binary format. Refer to the Python scripts section for a script to convert a dataset from JSON format. This process involves several parameters that regulate space/time trade-offs:
--n-postings
: Regulates the size of the posting list, representing the average number of postings stored per posting list.--summary-energy
: Controls the size of the summaries, preserving a fraction of the overall energy for each summary.--centroid-fraction
: Determines the number of centroids built for each posting list, capped at a fraction of the posting list length.
For Splade on MSMarco, good choices are --n-postings 3500
, --summary-energy 0.4
, and --centroid-fraction 0.1
.
The following command can be used to create a Seismic index serialized in the file splade.bin.3500.seismic
:
./target/release/build_inverted_index -i splade.bin -o splade.bin.3500_0.4_0.1 --centroid-fraction 0.1 --summary-energy 0.4 --n-postings 3500
To execute a set of queries, use the perf_inverted_index
executable. Two parameters, query-cut
and heap-factor
, trade-off efficiency vs accuracy:
--query-cut
: The search algorithm considers only the topquery_cut
components of the query.--heap-factor
: The search algorithm skips a block whose estimated dot product is greater thanheap_factor
times the smallest dot product of the top-k results in the current heap.
The following command exemplifies this:
./target/release/perf_inverted_index -i splade.bin.3500_0.4_0.1c -q splade_queries.bin -o results.tsv --query-cut 5 --heap-factor 0.7
The dataset of queries is in binary internal format. Refer again to the Python scripts section for a script to convert a dataset from JSON format.
The executable prints the average running time per query. Queries are executed in single-thread mode. To enable multithreading, modify the Rust code by replacing the iteration on the query from queries.iter().take(n_queries).enumerate()
to queries.par_iter().take(n_queries).enumerate()
.
The results are written in the file results.tsv
. For each query, there are k
lines, one for each of its results. Each line follows this format:
query_id\tdocument_id\tresult_rank\tdot_product
Here, query_id
is a progressive identifier for the query, document_id
is the identifier of the document in the indexed dataset, and result_rank
indicates their rank in the ordering by their dot_product
with the query.
To evaluate the accuracy of the retrieved results against an already computed ground truth, use the Python script scripts/accuracy.py
:
python3 scripts/accuracy.py groundtruth.tsv results.tsv
This will output the recall percentage.
The ground truth for a dataset can be computed with generate_groundtruth
as follows:
./target/release/generate_groundtruth -i splade.bin -q splade_queries.bin -o groundtruth.tsv
The exact top-10 results of each query are written in the file groundtruth.tsv
with the format described above.
Even if multithreading is enabled here, the execution may take a considerable amount of time due to the brute-force exact query algorithm that scans the entire dataset for each query.
Seismic Parameters
The table below reports the building parameters we used for the different datasets.
Dataset | --n-postings |
--centroid-fraction |
--summary-energy |
---|---|---|---|
MSMARCO-Splade | 4000 | 0.1 | 0.4 |
MSMARCO-Effsplade | 4000 | 0.1 | 0.4 |
MSMARCO-UniCOIL-T5 | 4000 | 0.1 | 0.4 |
NQ-Splade | 3500 | 0.15 | 0.5 |
The table below reports the query parameters we used for the different datasets and various levels of accuracy.
Results may change slightly if you re-create the indexes, due to the random selection of the centroids of Seismic.
Splade on MSMARCO
$hf$ | $q_c$ | Time [ $\mu s$ ] | Accuracy@10 |
---|---|---|---|
0.9 | 3 | 187 | 90.49 |
0.9 | 4 | 206 | 92.27 |
0.9 | 4 | 206 | 92.27 |
0.9 | 5 | 222 | 93.13 |
0.9 | 8 | 269 | 94.07 |
0.8 | 5 | 303 | 95.69 |
0.8 | 6 | 348 | 96.11 |
0.7 | 6 | 531 | 97.17 |
E-Splade on MSMARCO
$hf$ | $q_c$ | Time [ $\mu s $] | Accuracy@10 |
---|---|---|---|
1 | 3 | 222 | 90.99 |
1 | 4 | 222 | 90.99 |
1 | 4 | 239 | 93.26 |
1 | 4 | 239 | 93.26 |
1 | 6 | 256 | 94.17 |
0.9 | 4 | 376 | 95.95 |
0.9 | 5 | 383 | 96.53 |
0.8 | 5 | 581 | 97.47 |
Unicoil-T5 on MSMARCO
$hf$ | $q_c$ | Time [ $\mu s$ ] | Accuracy@10 |
---|---|---|---|
1 | 3 | 115 | 90.04 |
1 | 4 | 123 | 91.33 |
1 | 6 | 133 | 92.07 |
0.9 | 3 | 168 | 94.03 |
0.9 | 3 | 168 | 94.03 |
0.9 | 4 | 180 | 95.13 |
0.8 | 4 | 268 | 96.76 |
0.8 | 5 | 280 | 97.19 |
NQ on MSMARCO
$hf$ | $q_c$ | Time [ $\mu s$ ] | Accuracy@10 |
---|---|---|---|
1 | 3 | 195 | 91.25 |
1 | 4 | 195 | 91.25 |
1 | 6 | 211 | 92.23 |
0.9 | 3 | 240 | 93.24 |
0.9 | 3 | 266 | 95.17 |
0.9 | 4 | 266 | 95.17 |
0.8 | 4 | 286 | 96.26 |
0.8 | 5 | 362 | 97.18 |
We provide a script to explore the search parameters given a trained index; the script is scripts/grid_search_only_accuracy.sh
. You can use it as follows:
index_path=""
results_file_path=""
queries_path=""
gt_path=""
bash scripts/grid_search_only_accuracy.sh $index_path $results_file_path $queries_path $gt_path
The script writes the result of the grid search in results_file_path
. To get the fastest configuration at each accuracy cut, simply run
python scripts/extract_results.py --file-path $results_file_path
Python Scripts
Download the Datasets
The embeddings in jsonl
format used in our experiments can be downloaded from this HugginFace repository, together with the queries representations.
As an example, the Splade embeddings for MsMarco can be downloaded and extracted by runnning the following commands.
wget https://huggingface.co/datasets/tuskanny/seismic-msmarco-splade/resolve/main/documents.tar.gz?download=true -O documents.tar.gz
tar -xvzf documents.tar.gz
or by using the Huggingface dataset download tool.
Convert the Data
Documents and queries should have the following format. Each line should be a JSON-formatted string with the following fields:
id
: must represent the ID of the document as an integer.content
: the original content of the document, as a string. This field is optional.vector
: a dictionary where each key represents a token, and its corresponding value is the score, e.g.,{"dog": 2.45}
.
This is the standard output format of several libraries to train sparse models, such as learned-sparse-retrieval
.
The script convert_json_to_inner_format.py
allows converting files formatted accordingly into the seismic
inner format.
python scripts/convert_json_to_inner_format.py --document-path /path/to/document.jsonl --queries-path /path/to/queries.jsonl --output-dir /path/to/output
This will generate a data
directory at the /path/to/output
path, with documents.bin
and queries.bin
binary files inside.
If you download the NQ dataset from the HuggingFace repo, you need to specify --input-format nq
as it uses a slightly different format.
Using the Rust Code
To incorporate the Seismic library into your Rust project, navigate to your project directory and run the following Cargo command:
cargo add seismic
This command adds the Seismic library to your project.
Creating a Toy Dataset
Let's create a toy dataset comprising vectors with f32
values. Next, we'll convert this dataset to use half-precision floating points (half::f16
). Finally, we'll check the number of vectors, the dimensionality, and the number of non-zero components of the dataset.
use seismic::SparseDataset;
use half::f16;
let data = vec![
(vec![0, 2, 4], vec![1.0, 2.0, 3.0]),
(vec![1, 3], vec![4.0, 5.0]),
(vec![0, 1, 2, 3], vec![1.0, 2.0, 3.0, 4.0])
];
let dataset: SparseDataset<f16> = data.into_iter().collect::<SparseDataset<f32>>().into();;
assert_eq!(dataset.len(), 3); // Number of vectors
assert_eq!(dataset.dim(), 5); // Number of components
assert_eq!(dataset.nnz(), 9); // Number of non zero components
The following code shows how to read a dataset in the internal binary format with f32
values and quantize those values to f16
.
let dataset = SparseDataset::<f32>::read_bin_file(&input_filename).unwrap().quantize_f16();
Building and Querying an Index
Let's build an index using the above toy dataset and search for a query.
use seismic::inverted_index::{Configuration,InvertedIndex};
use seismic::SparseDataset;
use half::f16;
let data = vec![
(vec![0, 2, 4], vec![1.0, 2.0, 3.0]),
(vec![1, 3], vec![4.0, 5.0]),
(vec![0, 1, 2, 3], vec![1.0, 2.0, 3.0, 4.0])
];
let dataset: SparseDataset<f16> = data.into_iter().collect::<SparseDataset<f32>>().into();;
let inverted_index = InvertedIndex::build(dataset, Configuration::default());
let result = inverted_index.search(&vec![0, 1], &vec![1.0, 2.0], 1, 5, 0.7);
assert_eq!(result[0].0, 8.0);
assert_eq!(result[0].1, 1);
There are building configuration parameters to experiment with. Take a look at build_inverted_index.rs code for an example.
The most important ones are
n_postings
inPruningStrategy::GlobalThreshold
: Regulates the size of the posting list, representing the average number of postings stored per posting list.summary_energy
inSummarizationStrategy::EnergyPerserving
: Controls the size of the summaries, preserving a fraction of the overall energy for each summary.centroid_fraction
inBlockingStrategy::RandomKmeans
: Determines the number of centroids built for each posting list, capped at a fraction of the posting list length.
Refer to Seismic parameters for recommended values for different datasets.
Take a look at build_inverted_index.rs and perf_inverted_index.rs for examples to serialize/deserialize an index on a file.
The signature of the search
method is
pub fn search(
&self,
query_components: &[u16],
query_values: &[f32],
k: usize,
query_cut: usize,
heap_factor: f32,
) -> Vec<(f32, usize)>
It accepts a sparse vector for the query (query_components
and query_values
), k
for top-k
results, and parameters query_cut
and heap_factor
for trade-off between accuracy and query time.
query_cut
: The search algorithm considers only the topquery_cut
components of the query.heap_factor
: The search algorithm skips a block whose estimated dot product is greater thanheap_factor
times the smallest dot product of the top-k results in the current heap.
Refer to Seismic parameters for their influence on recall and query time on the different datasets.
Using the Python Interface
We have also included a Python interface for convenience.
It is fairly straightforward to build the Python interface with maturin
as follows:
pip install maturin
RUSTFLAGS="-C target-cpu=native" maturin build -r
for whl in target/wheels/*.whl; do pip3 install $whl; done
Confirm that the installation was successful by importing seismic
,
building an index, and querying it:
from seismic import PySeismicIndex
# We assume that the sparse dataset is in the internal format.
index = PySeismicIndex.build(
input_file,
n_postings=3500,
centroid_fraction=0.1,
truncated_kmeans_training=False,
truncation_size=16,
min_cluster_size=2,
summary_energy=0.4)
# You can serialize and store the index in a file.
index.save(index_path)
# You may later load the index to query it.
index = PySeismicIndex.load(index_path)
# Search can be done either for a single query.
results: List[Tuple[float, int]] = index.search(
query_components=np.array([...], dtype=np.int32),
query_values=np.array([...], dtype=np.float32),
k, query_cut, heap_factor)
# You may also (concurrently) search the index with a batch of
# queries. Assuming the queries are stored in the internal format,
# you can invoke the following function:
results: List[List[Tuple[float, int]]] = index.batch_search(
query_path, k, query_cut, heap_factor, num_threads)
Bibliography
- Sebastian Bruch, Franco Maria Nardini, Cosimo Rulli, Rossano Venturini. "Efficient Inverted Indexes for Approximate Retrieval over Learned Sparse Representations." In ACM SIGIR. 2024.
Citation License
The source code in this repository is subject to the following citation license:
By downloading and using this software, you agree to cite the under-noted paper in any kind of material you produce where it was used to conduct a search or experimentation, whether be it a research paper, dissertation, article, poster, presentation, or documentation. By using this software, you have agreed to the citation license.
@inproceedings{Seismic,
author = {Sebastian Bruch and Franco Maria Nardini and Cosimo Rulli and Rossano Venturini},
title = {Efficient Inverted Indexes for Approximate Retrieval over Learned Sparse Representations},
booktitle = {The 47th International ACM SIGIR Conference on Research and Development in Information Retrieval ({SIGIR})},
publisher = {ACM},
year = {2024}
}
Dependencies
~11–19MB
~273K SLoC