#vector-database #nlp #deep-learning #machine-learning #transformer

valentinus

Next generation vector database built with LMDB bindings

11 unstable releases (3 breaking)

0.4.1 Aug 6, 2024
0.4.0 Aug 3, 2024
0.3.2 Aug 2, 2024
0.3.1 Jul 30, 2024
0.1.2 Jul 13, 2024

#260 in Machine learning

Download history 115/week @ 2024-07-06 237/week @ 2024-07-13 335/week @ 2024-07-20 462/week @ 2024-07-27 258/week @ 2024-08-03 10/week @ 2024-08-10

391 downloads per month

Apache-2.0

140KB
997 lines

.github/workflows/rust.yml test Crates.io Version Crates.io Downloads (latest version) docs.rs GitHub commit activity Matrix

alt text

valentinus

next generation vector db built with lmdb bindings

dependencies

  • bincode/serde - serialize/deserialize
  • lmdb-rs - database bindings
  • ndarray - numpy equivalent
  • ort/onnx - embeddings

getting started

git clone https://github.com/kn0sys/valentinus && cd valentinus

optional environment variables

var usage default
LMDB_USER working directory of the user for database $USER
LMDB_MAP_SIZE Sets max environment size, i.e. size in memory/disk of all data 20% of available memory
ONNX_PARALLEL_THREADS parallel execution mode for this session 1
VALENTINUS_CUSTOM_DIM embeddings dimensions for custom models all-mini-lm-6 -> 384

tests

  • Note: all tests currently require the all-Mini-LM-L6-v2_onnx directory
  • Get the model.onnx and tokenizer.json from huggingface or build them
mkdir all-Mini-LM-L6-v2_onnx
cd all-Mini-LM-L6-v2_onnx && wget https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/resolve/main/config.json
wget https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/resolve/main/onnx/model.onnx
wget https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/resolve/main/special_tokens_map.json
wget https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/resolve/main/tokenizer_config.json
wget https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/resolve/main/tokenizer.json
wget https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/resolve/main/vocab.txt

examples

see examples

reference

inspired by this chromadb python tutorial

Dependencies

~20–46MB
~787K SLoC