#search-engine #vector-search #distributed #meaning #web #semantic #crawl

bin+lib dawnsearch

An open source distributed web search engine that searches by meaning

3 releases (breaking)

0.2.0 Aug 6, 2023
0.1.0 Aug 6, 2023
0.0.0 Aug 5, 2023

#2 in #crawl

50 downloads per month

AGPL-3.0-or-later

160KB
3.5K SLoC

DawnSearch

Build Status Crates.io Crates.io License

DawnSearch is an open source distributed web search engine that searches by meaning. It can index the Common Crawl data. It uses semantic search (searching on meaning), using all-MiniLM-L6-v2. It uses USearch for vector search. DawnSearch is written in Rust.

A public instance is available at dawnsearch.org.

Project Status

DawnSearch currently functions as a distributed (semantic) vector search. When you start an instance, it will register with the tracker. The instance can then participate in the network by searching. Optionally, it can index the common crawl dataset and answer queries.

Main items still to do:

  1. Better error handling. There still is a lot of .unwrap() in the code.
  2. Robustness agains malfunctioning or malicious instances.
  3. Packet encryption to prevent eavesdropping.
  4. Distribution of all the indexed pages to semantically close instances to increase search efficiency. Currently searches are sent to all instances.

Quick start

This will build and run an 'access terminal' DawnSearch instance on a recent Ubuntu, without GPU acceleration. See Modes for examples of other configurations.

sudo apt-get update && sudo apt-get install -y build-essential libssl-dev pkg-config python3-pip

# Install rust if you don't have it already:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

pip3 install torch==2.0.0 --index-url https://download.pytorch.org/whl/cpu

Now we need to make sure the build system can find PyTorch. We search for the package:

pip3 show torch

This prints the following:

Name: torch
Version: 2.0.0
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: /home/ubuntu/.local/lib/python3.10/site-packages
Requires: filelock, jinja2, networkx, sympy, typing-extensions
Required-by: 

Using the path from 'Location', put this in .bashrc. Note that you need to append '/torch'.

export LIBTORCH=/home/ubuntu/.local/lib/python3.10/site-packages/torch
export LD_LIBRARY_PATH=${LIBTORCH}/lib:$LD_LIBRARY_PATH

We can now load the new environment variables and build:

source ~/.bashrc
mv DawnSearch.toml.example DawnSearch.toml
cargo run --release

Now, go to http://localhost:8080 to access your own DawnSearch instance. You will be able to perform searches, but you will not contribute to the network yet. Take a look at Modes to see how you can do so.

If you want to upgrade to GPU acceleration try this:

pip3 install torch==2.0.0
cargo clean
cargo run --release

Alternatively, follow the steps as documented for the tch crate.

Note that on an M1/M2 Mac, 'cargo install' does NOT work. 'cargo build' does though!

Feel free to open an issue if you encounter problems!

Configuration

You can configure DawnSearch through DawnSearch.toml or through environment variables like DAWNSEARCH_INDEX_CC.

Contributing

Please open issues, or create pull requests! Please open an issue before you start working on a big enhancement or refactor.

See also

Dependencies

~56–78MB
~1.5M SLoC