#full-text-search #search-engine #search #text-search #read-write

tantivy-object-store

A tantivy Directory implementation against object stores (S3, GCS, etc.)

1 unstable release

0.1.0 Aug 14, 2023

#1185 in Filesystem

Apache-2.0

27KB
456 lines

Tantivy Object Store

This repo contains an implementation of a tantivy::directory::Directory using an object_store::ObjectStore.

This implementation supports both read and write, but does not support locking or file watch. The index building process is responsible for making sure that there are no concurrent index writers.

A few notable behavior differences from tantivy's directory implementations:

Versioning

Tantivy uses a file called meta.json which is a list of all the files that make up the index, effectively keeping track of a snapshot of the index. However, vanilla tantivy doesn't support versioning, meaning every time we update the index, meta.json is overwritten. This PR allows the caller to set a read_version and write_version. These version numbers are appended to the end of the file name when caller attempts to atomic_read or atomic_write.

Copy on Write (CoW)

When creating a ObjectStoreDirectory, user may set read_version and write_version. read_version is used when user calls atomic_read. Instead of reading meta.json, we will try to read meta.json.{read_version}. Same when user calls atomic_write, we will try to write meta.json.{write_version}

NOTE: The write_version take precedence over read_version. This means, after first write, atomic_read will read from meta.json.{write_version} NOT meta.json.{read_version}. This is needed because tantivy modifies meta.json file quite a few times during indexing, the CoW impl here needs have read-after-write consistency.

Index Reloading

This implementation does not support reloading. If a watch callback is registered, the callback will never be called. User needs to handle reloading via other mechanisms for now.

It maybe possible to use something like object store's native change notification to trigger reload, but that's for future work.

Deletion

Since tantivy attempts to garbage collect and merge index files during indexing, we had to change delete operation to noop. This is because we don't want tantivy to garbage collect files from past versions, as those files maybe in use by other readers. We will implement a garbage collection processes separately.

Threading

This implementation contains a tokio::Runtime for running the IO jobs. This means, when calling functions from this implementation from inside another tokio runtime the caller should always use tokio::task::spawn_blocking so the task can be scheduled on a thread without tokio runtime. (This is needed because nesting tokio runtimes causes panic)

Concurrency Safety

This implementation is concurrency safe within a single instance, as the atomic_read|write mechanism is a lock object in the returned trait object.

Performance

TBD

Dependencies

~29–67MB
~1M SLoC