#full-text-search #text-search #cloudflare-workers #search #bitmap #text #full

bin+lib edgesearch

Serverless full-text search with Cloudflare Workers, WebAssembly, and Roaring Bitmaps

15 releases (4 breaking)

0.4.1 Jan 19, 2021
0.4.0 Jun 12, 2020
0.3.2 Jun 11, 2020
0.2.0 May 24, 2020
0.0.2 Feb 29, 2020

#8 in #full

MIT license

780KB
15K SLoC

C 14K SLoC // 0.2% comments Rust 639 SLoC // 0.0% comments JavaScript 407 SLoC // 0.1% comments

Edgesearch

Build a full text search API using Cloudflare Workers and WebAssembly.

Features

Demos

Check out the demo folder for live demos with source code.

How it works

Edgesearch builds a reverse index by mapping terms to a compressed bit set (using Roaring Bitmaps) of IDs of documents containing the term, and creates a custom worker script and data to upload to Cloudflare Workers.

Data

An array of term-documents pairs sorted by term is built, where term is a string and documents is a compressed bit set.

This array is then split into chunks of up to 25 MiB, as each Cloudflare Workers KV entry can hold a value up to 25 MiB in size.

To find the documents bit set associated with a term, a binary search is done to find the appropriate chunk, and then the pair within the chunk.

The same structure and process is used to store and retrieve document contents.

Packing multiple bit sets/documents reduces read/write costs and deploy times, and improves caching and execution speed due to fewer fetches.

Searching

Search terms have an associated mode. There are three modes that match documents in different ways:

Mode Document match condition
Require Has all terms with this mode.
Contain Has at least one term with this mode.
Exclude Has none of the terms with this mode.

For example, a document with terms a, b, c, d, and e would match the query require (d, a) contain (g, b, f) exclude (h, i).

The results are generated by doing bitwise operations across multiple bit sets. The general computation could be summarised as:

result = (req_a & req_b & req_c & ...) & (con_a | con_b | con_c | ...) & ~(exc_a | exc_b | exc_c | ...)

Cloudflare

There are some nice advantages when only using Cloudflare Workers:

  • Faster than a VM or container with less cold starts, as code is run on a V8 Isolate.
  • Naturally distributed to the edge for very low latency.
  • Takes advantage of Cloudflare for SSL, caching, and distribution.
  • No need to worry about scaling, networking, or servers.

WebAssembly

The C implementation of Roaring Bitmaps is compiled to WebAssembly. A basic implementation of essential C standard library functionality is implemented to make compilation possible.

Usage

Get the CLI

Precompiled binaries are available for x86-64:

Linux | macOS | Windows

Build the worker

The data needs to be formatted into two files:

  • Documents: contents of all documents, delimited by NULL (ASCII 0), including at the end.
  • Document terms: terms for each corresponding document. Each term and document must end with NULL (ASCII 0).

This format allows for simple reading and writing without libraries, parsers, or loading all the data into memory. Terms are separate from documents for easy switching between or testing of different documents-terms mappings.

The relation between a document's terms and content is irrelevant to Edgesearch and terms do not necessarily have to be words from the document.

A document must be a JSON serialised value, such as "hello", 123, or {"prop1": 1, "prop2": {}}.

For example:

File Contents
documents {"title":"Stupid Love","artist":"Lady Gaga","year":2020} \0
{"title":"Don't Start Now","artist":"Dua Lipa","year":2020} \0
...
document-terms title_stupid \0 title_love \0 artist_lady \0 artist_gaga \0 year_2020 \0 \0
title_dont \0 title_start \0 title_now \0 artist_dua \0 artist_lipa \0 year_2020 \0 \0
...

An folder needs to be provided for Edgesearch to write temporary and built code and data files. It's advised to provide a folder for the exclusive use of Edgesearch with no other contents.

edgesearch \
  --documents documents \
  --document-terms document-terms \
  --maximum-query-results 20 \
  --output-dir /path/to/edgesearch/build/output/dir/

Deploy the worker

edgesearch-deploy-cloudflare handles deploying to Cloudflare.

This will upload the worker script and associated WASM to Cloudflare Workers, and write every key to Cloudflare Workers KV:

npx edgesearch-deploy-cloudflare \
  --account-id CF_ACCOUNT_ID \
  --account-email me@email.com \
  --global-api-key CF_GLOBAL_API_KEY \
  --name my-edgesearch \
  --output-dir /path/to/edgesearch/build/output/dir/ \
  --namespace CF_KV_NAMESPACE_ID \
  --upload-data

Testing locally

edgesearch-test-server loads a built worker to run locally.

This will create a local server on port 8080:

npx edgesearch-test-server \
  --output-dir /path/to/edgesearch/build/output/dir/ \
  --port 8080

The client can be used with a local test server; provide the origin (e.g. http://localhost:8080) to the constructor (see below).

Calling the API

A JavaScript client for the browser and Node.js is available for using a deployed Edgesearch worker:

import * as Edgesearch from 'edgesearch-client';

type Document = {
  title: string;
  artist: string;
  year: number;
};

const client = new Edgesearch.Client<Document>('https://my-edgesearch.me.workers.dev');
const query = new Edgesearch.Query();
query.add(Edgesearch.Mode.REQUIRE, 'world');
query.add(Edgesearch.Mode.CONTAIN, 'hello', 'welcome', 'greetings');
query.add(Edgesearch.Mode.EXCLUDE, 'bye', 'goodbye');
let response = await client.search(query);
query.setContinuation(response.continuation);
response = await client.search(query);

Performance

Searches that retrieve entries not cached at edge locations will be slow. To reduce cache misses, ensure that there is consistent traffic.

Dependencies

~4.5–7.5MB
~136K SLoC