1 unstable release
Uses new Rust 2024
| 0.1.0 | Feb 23, 2026 |
|---|
#276 in Text processing
82KB
1.5K
SLoC
Sift
A search DSL for agents. Compose, parallelize, and fuse code searches.
Sift is a tiny Lisp that composes search backends (ripgrep, BM25, embeddings) into parallel pipelines. Instead of an agent making five sequential grep calls and merging results in application code, it writes one expression and gets back ranked, deduplicated, blended results — automatically parallelized across backends.
The CLI command is ag.
# Find callers of eval, excluding its definition
ag '(- (rg "eval\\(") (rg "pub fn eval"))'
# Blend three search methods, take top 10
ag '(top 10 (mix (sem "auth flow") (rg "authenticate") (lex "authentication")))'
# Sequential pipeline: find files with structs, then search those for impls
ag '(pipe (rg "pub struct") (rg "impl"))'
Why
Agents grep. Sometimes they grep well, sometimes they miss things. The gap between "run ripgrep" and "actually find what I need" is filled with orchestration code — multiple tool calls, deduplication, re-ranking, set operations. Sift collapses that into a single expression.
- Parallel by default —
(mix (rg "x") (sem "x"))runs both backends concurrently. Total time = slowest backend, not the sum. - Set algebra on results — intersection, union, difference. Find lines matching A AND B. Find callers minus definitions.
- Ranked fusion — Reciprocal Rank Fusion blends results from different backends into a single ranked list.
- Three backends —
rg(exact grep),lex(BM25 via tantivy),sem(embeddings via ONNX). Feature-gated so you only compile what you need. - Sequential pipelines —
(pipe source target)runs source first, then scopes target to matching files. - Auto mode —
ag "query"without parens auto-wraps as ripgrep search. - Single binary — <2MB default, ~10MB with all features. No runtime, no config.
Install
# From crates.io (default: rg backend only, <2MB)
cargo install sift-search
# With BM25 indexing
cargo install sift-search --features lex
# With semantic embeddings
cargo install sift-search --features sem
# Everything
cargo install sift-search --features full
# From source
cargo install --path .
Requires ripgrep: brew install ripgrep
Quick Start
# Simple grep
ag '(rg "TODO")'
ag -g "TODO" # shorthand
ag "TODO" # auto mode (wraps as rg)
# Intersection: lines with both patterns
ag '(& (rg "async") (rg "tokio"))'
# Difference: callers minus definition
ag '(- (rg "eval\\(") (rg "pub fn eval"))'
# Sequential pipeline: narrow then refine
ag '(pipe (rg "pub struct") (rg "impl"))'
# Top 5 results
ag '(top 5 (rg "unsafe" :lang "rust"))'
# Blend with RRF
ag '(mix (rg "error") (rg "panic"))'
# BM25 search (requires --features lex)
ag '(lex "connection pool")'
# Semantic search (requires --features sem)
ag '(sem "error handling and recovery")'
# Output modes
ag --files '(rg "TODO")' # paths only
ag --json '(rg "TODO")' # machine-readable
ag --scores '(rg "TODO")' # with relevance scores
# Index management
ag --index # build lex/sem indexes
ag --index-status # show index info
ag --index-clean # remove indexes
Docs
- Language Reference — full syntax, all forms, execution model
- Cheatsheet — dense single-page reference (ideal for LLM context)
- Examples — runnable
.sqfiles covering every feature
Examples
Every example searches this repo and is tested in CI.
| File | Feature | What it does |
|---|---|---|
| basics.sq | rg |
Simple pattern search |
| set-operations.sq | - |
Difference: callers minus definitions |
| intersection.sq | & |
Lines matching ALL patterns |
| union.sq | | |
Lines matching ANY pattern |
| ranking.sq | mix, top |
RRF blend + top-k |
| weighted-mix.sq | mix [w ...] |
Weighted RRF blend |
| threshold.sq | > |
Score threshold filter |
| filters.sq | :lang, :x |
File type + exclude filters |
| let-bindings.sq | let |
Named intermediate results |
| pipe.sq | pipe |
Sequential pipeline search |
| agent-patterns.sq | combined | Multi-step agent strategies |
| output-modes.sq | output | files, scores, json rendering |
Run any example: ag -f examples/basics.sq
Architecture
ag '(& (rg "async") (rg "fn"))'
&
/ \
rg rg <- parallel tokio tasks
\ /
intersect <- fuse results
|
output
Eight modules, one binary:
| Module | Purpose |
|---|---|
core |
Hit, ResultSet, Score, Expr AST, errors |
parse |
S-expression tokenizer + recursive descent parser |
rg |
Ripgrep backend (shells out to rg --json) |
lex |
BM25 backend via tantivy (feature-gated) |
sem |
Embedding backend via ONNX Runtime (feature-gated) |
fusion |
RRF, intersect, union, difference, top-k, threshold |
eval |
Async evaluator — thin dispatcher, parallel fan-out |
util |
Shared helpers: file filtering, lang matching, glob |
Feature Flags
| Feature | Adds | Binary size |
|---|---|---|
| (default) | rg backend + DSL |
~2MB |
lex |
tantivy BM25 indexing | ~5MB |
sem |
ONNX embedding search | ~8MB |
full |
everything | ~10MB |
Technical Details
Reciprocal Rank Fusion (RRF)
Different search backends produce incomparable scores — ripgrep doesn't score at all, BM25 returns term frequencies, embeddings return cosine similarities. Comparing or averaging these raw numbers is meaningless.
RRF sidesteps this entirely by ignoring scores and using only rank position:
RRF_score(hit) = Σ weight_i / (k + rank_i)
Where k=60 is a smoothing constant. A hit ranked #1 by two backends scores higher than one ranked #1 by one and #50 by the other — regardless of what the raw scores were. This makes (mix (rg "x") (sem "x")) meaningful even though the backends measure completely different things.
After fusion, scores are normalized to [0, 1] by dividing by the maximum, so thresholds and display are intuitive.
Parallel Fan-Out
Every combinator (&, |, mix, -) spawns its children as concurrent tokio tasks via futures::join_all. A query like:
ag '(mix (rg "auth") (rg "login") (rg "session"))'
runs three ripgrep processes simultaneously. Total latency = slowest child, not the sum. This extends to nested expressions — the evaluator recurses into children in parallel at every level of the AST.
Sequential Pipelines
The pipe combinator provides tiered search — narrow first, refine second:
ag '(pipe (rg "pub struct") (rg "impl"))'
This evaluates the source (rg "pub struct") first, extracts the set of matching files, then rewrites the target expression to scope its searches to only those files. This powers patterns like "find files about authentication, then search those for SQL queries."
AST as Sum Type, Evaluator as Thin Dispatcher
The entire DSL is a single enum (Expr) with one variant per form. The evaluator is a single match that delegates to one handler per variant — no if/else chains, no fallbacks, no type checks inside handlers. Adding a new form means adding one variant and one match arm.
Expr = Rg | Lex | Sem | And | Or | Mix | Diff | Pipe | Top | Threshold | ...
eval = match expr { Rg => ..., And => ..., Mix => ..., Pipe => ..., ... }
Why Shell Out to ripgrep
The rg backend runs ripgrep as a subprocess rather than linking it as a library. This keeps the binary small (~1.6MB), avoids pulling in ripgrep's dependency tree, and means rg is always the same version the user already has installed. Sift parses rg --json output, which gives structured match data with file paths, line numbers, and match offsets.
Positional Scoring for Unranked Backends
ripgrep returns matches in file order, not ranked by relevance. To make these results compatible with RRF (which needs ranks), hits are assigned positional scores — the first result gets score 1.0, linearly decreasing to 0.0 for the last. This preserves the "earlier matches are probably more relevant" heuristic from ripgrep's file-order traversal while giving RRF meaningful ranks to work with.
Roadmap
-
rgbackend — exact grep, always fresh - Combinators —
&,|,mix,-,top,> - Let bindings
- Output modes — files, scores, json
- Sequential pipelines —
pipe - Auto mode —
ag "query"without parens -
lexbackend — tantivy BM25 indexing (feature-gated) -
sembackend — embedding similarity via ONNX (feature-gated) -
ag --index— build/manage indexes - 55 tests (24 unit + 31 integration)
- Streaming progressive output
-
ag index --watch— background index daemon - Tree-sitter aware chunking for sem backend
License
MIT
Dependencies
~7–21MB
~355K SLoC