1 stable release

1.0.0 Feb 7, 2025

#823 in Text processing

AGPL-3.0-or-later

15KB
366 lines

Purlu

Purlu is a full-text search engine.

Introduction

Purlu is designed for collections with a relatively small number of documents (let's say less than five thousands).

What's more, it's designed to run on relatively low-resource machines (like an office PC with 4 gigabytes of RAM and a pentium that's been turned into a nice server).

However, the idea is to offer plenty of cool features. Purlu isn't simple, but it tries to be light.

Purlu can be used via its CLI, which enables its search features to be used via a JSON HTTP API (there is a container image in the packages of this repository).

Or by embedding its (Rust) library directly into a project, as you might do with SQLite.

Text analysis

Purlu works by matching query terms with document terms.

So, for example, the query bunny may match with the document A very cute bunny!.

But how does Purlu obtain these terms? Thanks to the text analyzer, which is applied to both queries and documents.

For example, if we run the analyzer on this text:

"Be gay, do crime!"

We may get back (depending on the analyzer's configuration):

["be", "gay", "do", "crime"]

Purlu's analyzer can optionnaly:

Disabling all these features can be useful. For example, we may want to store external identifiers as is.

I think the explanation of the Snowball project on What is Stemming? is great, so I'll just quote it here:

Stemming maps different forms of the same word to a common "stem" - for example, the English stemmer maps connection, connections, connective, connected, and connecting to connect. So a searching for connected would also find documents which only have the other forms.

This stem form is often a word itself, but this is not always the case as this is not a requirement for text search systems, which are the intended field of use. We also aim to conflate words with the same meaning, rather than all words with a common linguistic root (so awe and awful don't have the same stem), and over-stemming is more problematic than under-stemming so we tend not to stem in cases that are hard to resolve. If you want to always reduce words to a root form and/or get a root form which is itself a word then Snowball's stemming algorithms likely aren't the right answer.

Scoring

Purlu assigns a relevance score to documents that match a query, and then sort the results by this score in descending order.

It uses the BM25F ranking function, as described in the article Integrating the Probabilistic Model BM25/BM25F into Lucene.

It's actually an extension of BM25 supporting document with multiple fields (like a title and a description for example).

This ranking function is quite cool:

  • The more frequent a query term is in a document, the higher its score.
  • But not too much either: this increase is not linear, to avoid a document with a term repeated many times becoming too important.
  • The frequency of terms in the entire collection is taken into account. If a term is rare in the collection, its importance in the score will be higher, and if a term is frequent in the collection, its importance in the score will be lower.
  • Document length is also taken into account. So, if a document is long, the importance of its terms in the score will be lower. This compensates for the fact that a long document is likely to have more terms that match the query.

Purlu's multi-field support means that it can take weights to be assigned to each field (known as boosts) as parameters for queries.

These weights can be used to give more importance to a title than a description, for example.

Indexes

Purlu indexes are immutable, stored in memory and not persisted on disk.

This means that if you want to add a document to an index, or delete one, you'll have to recreate the index.

It also means that if Purlu is restarted, the indexes will disappear.

This is possible because Purlu is designed for small datasets.

In this case, it may be desirable anyway to re-index each time you wish to synchronize a collection of documents (stored in PostgreSQL, for example) with Purlu.

This can help avoid bugs where, for example, a change has been missed and it will take days for a re-indexation to take place and for the change to be taken into account.

Another important thing is that Purlu indexes need a schema.

This schema actually consists of a list of fields. For each one, you can specify :

  • Whether it will be indexed or not? If it is indexed, its terms will be used in the search.
  • Whether it will be stored or not? If it is stored, its values will be returned with the results.

Queries

We already know that queries take as parameter a text that will be analyzed.

We also know that queries can assign weights to fields.

Well, queries also support pagination using offset and limit.

The movies example

This repository contains a movies folder.

Inside, there is a movies.json file. It's the Movies Dataset from Meilisearch.

It weights 19MB and contains 31968 movie descriptions.

There is also a index.py script. It will index the title and overview fields.

It expects a Purlu server to be available at localhost:1312.

Finally, there is a search.html document allowing to search the indexed dataset with a simple interface.

On my laptop (which is a potato), the query naruto finishes in less than a millisecond.

HTTP JSON API

analyzer objects

Schema fields and queries optionally take an analyzer object.

This object is used to configure text analysis.

It can contain three fields, all optional, and false by default (also if the object is not defined).

  • a tokenize boolean, if true the text will be segmented into words;
  • a lowercase boolean, if true the text will be lowercased;
  • a stemmer_language string, if defined the text will be lowercased (overriding the lowercase boolean) and stemmed.

Here is the list of available stemmer_language:

  • arabic
  • danish
  • dutch
  • english
  • french
  • german
  • greek
  • hungarian
  • italian
  • norwegian
  • portuguese
  • romanian
  • russian
  • spanish
  • swedish
  • tamil
  • turkish

Here is an example analyzer object:

{
    "lowercase": true
}

POST /indexes/:index_id

This route will create an index, and if an index with the same identifier already exists, replace it.

Purlu expects the request body to be a JSON object containing:

  • a schema, specifically a list of fields, which are objects made up of:
    • a name string;
    • optionally an analyzer object;
    • optionally an indexed boolean, which by default is false;
    • optionally a stored boolean, which by default is false;
  • and a list documents.

Here is an example request:

{
    "fields": [
        {
            "name": "id",
            "stored": true
        },
        {
            "name": "title",
            "analyzer": {
                "tokenize": true,
                "stemmer_language": "english"
            },
            "indexed": true,
            "stored": true
        },
        {
            "name": "description",
            "analyzer": {
                "tokenize": true,
                "stemmer_language": "english"
            },
            "indexed": true
        }
    ],
    "documents": [
        {
            "id": "12",
            "title": "On cute rabbits",
            "description": "Cute rabbits are so cute!"
        }
    ]
}

There is no reserved field name.

You may omit any field in the documents.

If a document has fields with a name not declared in the schema (the fields list), these fields will be ignored.

Document fields can only be strings. If you wish to store a number, for example an identifier from your main database, you'll need to send it as a string.

DELETE /indexes/:index_id

This route will delete an index if it exists and do nothing otherwise.

POST /indexes/:index_id/search

This route will search an index.

Purlu expects the request body to be a JSON object containing:

  • a query string;
  • optionally an analyzer object;
  • optionally a boosts object mapping fields names (those not declared in the schema will be ignored) to a weight (by default 1.0);
  • and optionally an offset (by default 0) and a limit (by default no limit).

Here is an example request:

{
    "query": "Where are all the cute rabbits?",
    "analyzer": {
        "tokenize": true,
        "stemmer_language": "english"
    },
    "boosts": {
        "title": 2.0
    }
}

And an example response:

{
    "count": 1,
    "hits": [
        {
            "score": 1.23456789,
            "values": {
                "id": "12",
                "title": "On cute rabbits"
            }
        }
    ]
}

Only stored fields will be present in values objects.

Dependencies

~4MB
~52K SLoC