1 stable release
1.0.0 | Feb 7, 2025 |
---|
#823 in Text processing
15KB
366 lines
Purlu
Purlu is a full-text search engine.
Introduction
Purlu is designed for collections with a relatively small number of documents (let's say less than five thousands).
What's more, it's designed to run on relatively low-resource machines (like an office PC with 4 gigabytes of RAM and a pentium that's been turned into a nice server).
However, the idea is to offer plenty of cool features. Purlu isn't simple, but it tries to be light.
Purlu can be used via its CLI, which enables its search features to be used via a JSON HTTP API (there is a container image in the packages of this repository).
Or by embedding its (Rust) library directly into a project, as you might do with SQLite.
Text analysis
Purlu works by matching query terms with document terms.
So, for example, the query bunny
may match with the document A very cute bunny!
.
But how does Purlu obtain these terms? Thanks to the text analyzer, which is applied to both queries and documents.
For example, if we run the analyzer on this text:
"Be gay, do crime!"
We may get back (depending on the analyzer's configuration):
["be", "gay", "do", "crime"]
Purlu's analyzer can optionnaly:
- segment text into words using unicode-segmentation;
- lowercase text;
- and apply stemming using rust-stemmers.
Disabling all these features can be useful. For example, we may want to store external identifiers as is.
I think the explanation of the Snowball project on What is Stemming? is great, so I'll just quote it here:
Stemming maps different forms of the same word to a common "stem" - for example, the English stemmer maps connection, connections, connective, connected, and connecting to connect. So a searching for connected would also find documents which only have the other forms.
This stem form is often a word itself, but this is not always the case as this is not a requirement for text search systems, which are the intended field of use. We also aim to conflate words with the same meaning, rather than all words with a common linguistic root (so awe and awful don't have the same stem), and over-stemming is more problematic than under-stemming so we tend not to stem in cases that are hard to resolve. If you want to always reduce words to a root form and/or get a root form which is itself a word then Snowball's stemming algorithms likely aren't the right answer.
Scoring
Purlu assigns a relevance score to documents that match a query, and then sort the results by this score in descending order.
It uses the BM25F ranking function, as described in the article Integrating the Probabilistic Model BM25/BM25F into Lucene.
It's actually an extension of BM25 supporting document with multiple fields (like a title and a description for example).
This ranking function is quite cool:
- The more frequent a query term is in a document, the higher its score.
- But not too much either: this increase is not linear, to avoid a document with a term repeated many times becoming too important.
- The frequency of terms in the entire collection is taken into account. If a term is rare in the collection, its importance in the score will be higher, and if a term is frequent in the collection, its importance in the score will be lower.
- Document length is also taken into account. So, if a document is long, the importance of its terms in the score will be lower. This compensates for the fact that a long document is likely to have more terms that match the query.
Purlu's multi-field support means that it can take weights to be assigned to each field (known as boosts
) as parameters for queries.
These weights can be used to give more importance to a title than a description, for example.
Indexes
Purlu indexes are immutable, stored in memory and not persisted on disk.
This means that if you want to add a document to an index, or delete one, you'll have to recreate the index.
It also means that if Purlu is restarted, the indexes will disappear.
This is possible because Purlu is designed for small datasets.
In this case, it may be desirable anyway to re-index each time you wish to synchronize a collection of documents (stored in PostgreSQL, for example) with Purlu.
This can help avoid bugs where, for example, a change has been missed and it will take days for a re-indexation to take place and for the change to be taken into account.
Another important thing is that Purlu indexes need a schema.
This schema actually consists of a list of fields. For each one, you can specify :
- Whether it will be indexed or not? If it is indexed, its terms will be used in the search.
- Whether it will be stored or not? If it is stored, its values will be returned with the results.
Queries
We already know that queries take as parameter a text that will be analyzed.
We also know that queries can assign weights to fields.
Well, queries also support pagination using offset
and limit
.
The movies example
This repository contains a movies
folder.
Inside, there is a movies.json
file. It's the Movies Dataset from Meilisearch.
It weights 19MB
and contains 31968
movie descriptions.
There is also a index.py
script. It will index the title
and overview
fields.
It expects a Purlu server to be available at localhost:1312
.
Finally, there is a search.html
document allowing to search the indexed dataset with a simple interface.
On my laptop (which is a potato), the query naruto
finishes in less than a millisecond.
HTTP JSON API
analyzer
objects
Schema fields and queries optionally take an analyzer
object.
This object is used to configure text analysis.
It can contain three fields, all optional, and false by default (also if the object is not defined).
- a
tokenize
boolean, iftrue
the text will be segmented into words; - a
lowercase
boolean, iftrue
the text will be lowercased; - a
stemmer_language
string, if defined the text will be lowercased (overriding thelowercase
boolean) and stemmed.
Here is the list of available stemmer_language
:
arabic
danish
dutch
english
french
german
greek
hungarian
italian
norwegian
portuguese
romanian
russian
spanish
swedish
tamil
turkish
Here is an example analyzer
object:
{
"lowercase": true
}
POST /indexes/:index_id
This route will create an index, and if an index with the same identifier already exists, replace it.
Purlu expects the request body to be a JSON object containing:
- a schema, specifically a list of
fields
, which are objects made up of:- a
name
string; - optionally an
analyzer
object; - optionally an
indexed
boolean, which by default isfalse
; - optionally a
stored
boolean, which by default isfalse
;
- a
- and a list
documents
.
Here is an example request:
{
"fields": [
{
"name": "id",
"stored": true
},
{
"name": "title",
"analyzer": {
"tokenize": true,
"stemmer_language": "english"
},
"indexed": true,
"stored": true
},
{
"name": "description",
"analyzer": {
"tokenize": true,
"stemmer_language": "english"
},
"indexed": true
}
],
"documents": [
{
"id": "12",
"title": "On cute rabbits",
"description": "Cute rabbits are so cute!"
}
]
}
There is no reserved field name.
You may omit any field in the documents.
If a document has fields with a name not declared in the schema (the fields
list), these fields will be ignored.
Document fields can only be strings. If you wish to store a number, for example an identifier from your main database, you'll need to send it as a string.
DELETE /indexes/:index_id
This route will delete an index if it exists and do nothing otherwise.
POST /indexes/:index_id/search
This route will search an index.
Purlu expects the request body to be a JSON object containing:
- a
query
string; - optionally an
analyzer
object; - optionally a
boosts
object mapping fields names (those not declared in the schema will be ignored) to a weight (by default1.0
); - and optionally an
offset
(by default0
) and alimit
(by default no limit).
Here is an example request:
{
"query": "Where are all the cute rabbits?",
"analyzer": {
"tokenize": true,
"stemmer_language": "english"
},
"boosts": {
"title": 2.0
}
}
And an example response:
{
"count": 1,
"hits": [
{
"score": 1.23456789,
"values": {
"id": "12",
"title": "On cute rabbits"
}
}
]
}
Only stored
fields will be present in values
objects.
Dependencies
~4MB
~52K SLoC