#agent #ripgrep #search #embedding #code-search

app sift-search

Sift — a DSL for agents to search codebases in faster, deeper, and more sophisticated ways

1 unstable release

Uses new Rust 2024

0.1.0 Feb 23, 2026

#276 in Text processing

MIT license

82KB
1.5K SLoC

Sift

A search DSL for agents. Compose, parallelize, and fuse code searches.

Sift is a tiny Lisp that composes search backends (ripgrep, BM25, embeddings) into parallel pipelines. Instead of an agent making five sequential grep calls and merging results in application code, it writes one expression and gets back ranked, deduplicated, blended results — automatically parallelized across backends.

The CLI command is ag.

# Find callers of eval, excluding its definition
ag '(- (rg "eval\\(") (rg "pub fn eval"))'

# Blend three search methods, take top 10
ag '(top 10 (mix (sem "auth flow") (rg "authenticate") (lex "authentication")))'

# Sequential pipeline: find files with structs, then search those for impls
ag '(pipe (rg "pub struct") (rg "impl"))'

Why

Agents grep. Sometimes they grep well, sometimes they miss things. The gap between "run ripgrep" and "actually find what I need" is filled with orchestration code — multiple tool calls, deduplication, re-ranking, set operations. Sift collapses that into a single expression.

  • Parallel by default(mix (rg "x") (sem "x")) runs both backends concurrently. Total time = slowest backend, not the sum.
  • Set algebra on results — intersection, union, difference. Find lines matching A AND B. Find callers minus definitions.
  • Ranked fusionReciprocal Rank Fusion blends results from different backends into a single ranked list.
  • Three backendsrg (exact grep), lex (BM25 via tantivy), sem (embeddings via ONNX). Feature-gated so you only compile what you need.
  • Sequential pipelines(pipe source target) runs source first, then scopes target to matching files.
  • Auto modeag "query" without parens auto-wraps as ripgrep search.
  • Single binary — <2MB default, ~10MB with all features. No runtime, no config.

Install

# From crates.io (default: rg backend only, <2MB)
cargo install sift-search

# With BM25 indexing
cargo install sift-search --features lex

# With semantic embeddings
cargo install sift-search --features sem

# Everything
cargo install sift-search --features full

# From source
cargo install --path .

Requires ripgrep: brew install ripgrep

Quick Start

# Simple grep
ag '(rg "TODO")'
ag -g "TODO"                          # shorthand
ag "TODO"                             # auto mode (wraps as rg)

# Intersection: lines with both patterns
ag '(& (rg "async") (rg "tokio"))'

# Difference: callers minus definition
ag '(- (rg "eval\\(") (rg "pub fn eval"))'

# Sequential pipeline: narrow then refine
ag '(pipe (rg "pub struct") (rg "impl"))'

# Top 5 results
ag '(top 5 (rg "unsafe" :lang "rust"))'

# Blend with RRF
ag '(mix (rg "error") (rg "panic"))'

# BM25 search (requires --features lex)
ag '(lex "connection pool")'

# Semantic search (requires --features sem)
ag '(sem "error handling and recovery")'

# Output modes
ag --files '(rg "TODO")'              # paths only
ag --json '(rg "TODO")'               # machine-readable
ag --scores '(rg "TODO")'             # with relevance scores

# Index management
ag --index                            # build lex/sem indexes
ag --index-status                     # show index info
ag --index-clean                      # remove indexes

Docs

  • Language Reference — full syntax, all forms, execution model
  • Cheatsheet — dense single-page reference (ideal for LLM context)
  • Examples — runnable .sq files covering every feature

Examples

Every example searches this repo and is tested in CI.

File Feature What it does
basics.sq rg Simple pattern search
set-operations.sq - Difference: callers minus definitions
intersection.sq & Lines matching ALL patterns
union.sq | Lines matching ANY pattern
ranking.sq mix, top RRF blend + top-k
weighted-mix.sq mix [w ...] Weighted RRF blend
threshold.sq > Score threshold filter
filters.sq :lang, :x File type + exclude filters
let-bindings.sq let Named intermediate results
pipe.sq pipe Sequential pipeline search
agent-patterns.sq combined Multi-step agent strategies
output-modes.sq output files, scores, json rendering

Run any example: ag -f examples/basics.sq

Architecture

ag '(& (rg "async") (rg "fn"))'

        &
       / \
     rg   rg      <- parallel tokio tasks
      \  /
    intersect      <- fuse results
        |
     output

Eight modules, one binary:

Module Purpose
core Hit, ResultSet, Score, Expr AST, errors
parse S-expression tokenizer + recursive descent parser
rg Ripgrep backend (shells out to rg --json)
lex BM25 backend via tantivy (feature-gated)
sem Embedding backend via ONNX Runtime (feature-gated)
fusion RRF, intersect, union, difference, top-k, threshold
eval Async evaluator — thin dispatcher, parallel fan-out
util Shared helpers: file filtering, lang matching, glob

Feature Flags

Feature Adds Binary size
(default) rg backend + DSL ~2MB
lex tantivy BM25 indexing ~5MB
sem ONNX embedding search ~8MB
full everything ~10MB

Technical Details

Reciprocal Rank Fusion (RRF)

Different search backends produce incomparable scores — ripgrep doesn't score at all, BM25 returns term frequencies, embeddings return cosine similarities. Comparing or averaging these raw numbers is meaningless.

RRF sidesteps this entirely by ignoring scores and using only rank position:

RRF_score(hit) = Σ  weight_i / (k + rank_i)

Where k=60 is a smoothing constant. A hit ranked #1 by two backends scores higher than one ranked #1 by one and #50 by the other — regardless of what the raw scores were. This makes (mix (rg "x") (sem "x")) meaningful even though the backends measure completely different things.

After fusion, scores are normalized to [0, 1] by dividing by the maximum, so thresholds and display are intuitive.

Parallel Fan-Out

Every combinator (&, |, mix, -) spawns its children as concurrent tokio tasks via futures::join_all. A query like:

ag '(mix (rg "auth") (rg "login") (rg "session"))'

runs three ripgrep processes simultaneously. Total latency = slowest child, not the sum. This extends to nested expressions — the evaluator recurses into children in parallel at every level of the AST.

Sequential Pipelines

The pipe combinator provides tiered search — narrow first, refine second:

ag '(pipe (rg "pub struct") (rg "impl"))'

This evaluates the source (rg "pub struct") first, extracts the set of matching files, then rewrites the target expression to scope its searches to only those files. This powers patterns like "find files about authentication, then search those for SQL queries."

AST as Sum Type, Evaluator as Thin Dispatcher

The entire DSL is a single enum (Expr) with one variant per form. The evaluator is a single match that delegates to one handler per variant — no if/else chains, no fallbacks, no type checks inside handlers. Adding a new form means adding one variant and one match arm.

Expr = Rg | Lex | Sem | And | Or | Mix | Diff | Pipe | Top | Threshold | ...
eval = match expr { Rg => ..., And => ..., Mix => ..., Pipe => ..., ... }

Why Shell Out to ripgrep

The rg backend runs ripgrep as a subprocess rather than linking it as a library. This keeps the binary small (~1.6MB), avoids pulling in ripgrep's dependency tree, and means rg is always the same version the user already has installed. Sift parses rg --json output, which gives structured match data with file paths, line numbers, and match offsets.

Positional Scoring for Unranked Backends

ripgrep returns matches in file order, not ranked by relevance. To make these results compatible with RRF (which needs ranks), hits are assigned positional scores — the first result gets score 1.0, linearly decreasing to 0.0 for the last. This preserves the "earlier matches are probably more relevant" heuristic from ripgrep's file-order traversal while giving RRF meaningful ranks to work with.

Roadmap

  • rg backend — exact grep, always fresh
  • Combinators — &, |, mix, -, top, >
  • Let bindings
  • Output modes — files, scores, json
  • Sequential pipelines — pipe
  • Auto mode — ag "query" without parens
  • lex backend — tantivy BM25 indexing (feature-gated)
  • sem backend — embedding similarity via ONNX (feature-gated)
  • ag --index — build/manage indexes
  • 55 tests (24 unit + 31 integration)
  • Streaming progressive output
  • ag index --watch — background index daemon
  • Tree-sitter aware chunking for sem backend

License

MIT

Dependencies

~7–21MB
~355K SLoC