20 unstable releases (3 breaking)
| new 0.7.9 | Apr 15, 2026 |
|---|---|
| 0.7.8 | Apr 10, 2026 |
| 0.7.0 | Mar 31, 2026 |
| 0.6.15 | Mar 24, 2026 |
| 0.3.10 | Jan 28, 2026 |
#886 in Database interfaces
1,469 downloads per month
Used in 5 crates
(3 directly)
3.5MB
53K
SLoC
hamelin_translation
This crate lowers Hamelin's type-checked AST into a backend-agnostic intermediate
representation (IR). It sits between hamelin_lib (parsing and type-checking)
and backend crates like hamelin_datafusion.
hamelin_lib (TypedAST)
↓
hamelin_translation (IR) ← this crate
↓
hamelin_datafusion, hamelin_trino, etc.
The TypedStatement AST goes through a series of normalization passes that
simplify and canonicalize the query structure. The resulting IRStatement
enforces invariants that make backend translation straightforward—backends
only need to handle a reduced set of commands with predictable shapes.
IR guarantees
Commands that get lowered away
These commands do not exist in the IR:
LET— fused intoSELECTbyfuse_projectionsDROP— fused intoSELECTbyfuse_projectionsWITHIN— converted toWHEREwith explicit timestamp boundsPARSE— converted toLET+WHERE(regex extraction + filter)UNNEST— converted toEXPLODE(if array) +LET+DROPNEST— converted toSELECTwith compound identifiers, then packedMATCH— lowered toFROM+LET+WHERE+WINDOW+WHERE+DROP
Identifier normalization
All assignments in the IR use simple identifiers. Compound paths like user.name
get packed into struct literals, so x.a = 1, x.b = 2 becomes x = {a: 1, b: 2}.
FROM aliases and JOIN/LOOKUP right sides get hoisted into named subqueries,
leaving only simple identifier references in the main pipeline.
Schema normalization
When FROM/UNION combines tables with differing schemas, the IR widens each table to a common schema via named subqueries. Array literals get similar treatment—if struct elements have different fields, they get widened to match the array's element type.
Window frames
WINDOW commands have pre-computed WindowFrame values rather than expressions.
Non-deterministic functions like now() and today() are disallowed in window
frame expressions since the frame bounds must be constant.
EXPLODE canonical form
After normalization, EXPLODE is always in canonical form: EXPLODE col = col.
The column explodes in place, keeping the same name but changing from an array
type to its element type.
JOIN normalization
Both JOIN and LOOKUP lower to IRJoinCommand—JOIN becomes an inner join,
LOOKUP becomes a left join. Missing ON conditions default to true for
cross join semantics.
Usage
use hamelin_translation::{lower, normalize_with};
// Simple case
let ir = lower(typed_statement)?;
// With options
let ir = normalize_with()
.with_timestamp_field("event_time")
.lower(typed_statement)?;
IR types
| Type | Description |
|---|---|
IRStatement |
Top-level: WITH clauses + main pipeline |
IRPipeline |
Sequence of IRCommands with output schema |
IRCommand |
Command with kind, span, and output schema |
IRExpression |
Wrapper around TypedExpression |
IRAssignment |
SimpleIdentifier + IRExpression pair |
Command kinds
| IR Command | Source Commands |
|---|---|
From |
FROM, UNION |
Where |
WHERE, WITHIN |
Select |
SELECT, LET, DROP, UNNEST |
Agg |
AGG |
Window |
WINDOW, MATCH |
Sort |
SORT |
Limit |
LIMIT |
Explode |
EXPLODE |
Join |
JOIN, LOOKUP |
Normalization passes
Lowering runs a series of normalization passes before converting to IR.
Statement-level passes
lower_match — Lowers MATCH into FROM + LET + WHERE + WINDOW + WHERE
DROP. It assigns a synthetic label per source row, aggregates those labels into a state string, and filters with a regex. This must run first so later passes can treat pattern matching as a normal pipeline.
Example (before):
MATCH a=events+ b=logs BY host WITHIN 5m
Example (after):
FROM a=events, b=logs
| LET __pattern_label = case(a IS NOT NULL: "a", b IS NOT NULL: "b")
| WHERE __pattern_label IS NOT NULL
| WINDOW __state = array_join(array_agg(__pattern_label), ",")
WITHIN 5m
BY host
| WHERE __pattern_label = "a"
AND regexp_like(__state, "^(a,)+b(,b)*")
| DROP __pattern_label, __state
JOIN/LOOKUP lowering (IR phase) — During IR lowering (IRStatement::from_typed),
JOIN/LOOKUP right sides are hoisted into synthetic CTEs with FROM <table> | NEST <alias>.
This preserves the original alias structure (e.g., users.id) while making the right
side a nested struct. Missing ON clauses default to true.
Example (before):
FROM events
| JOIN users ON user_id == users.id
Example (after IR lowering):
WITH __join_0 = FROM users | NEST users
FROM events
| JOIN __join_0 ON user_id == users.id
nest_from_aliases — Rewrites aliased FROM clauses into CTEs with NEST.
This turns FROM x=table into a named subquery that nests all fields under x.
The main pipeline then references the generated CTE name.
Example (before):
FROM x=events
| WHERE x.a > 10
Example (after):
WITH __alias_0 = FROM events | NEST x
FROM __alias_0
| WHERE x.a > 10
from_to_union — Converts multi-source FROM into UNION so later passes
only handle schema widening on UNION. Single-source FROM stays unchanged.
Aliased sources are handled earlier and are not converted here.
Example (before):
FROM events, logs
Example (after):
UNION events, logs
expand_union_schemas — For UNION sources with different schemas, builds
CTEs that project a widened schema with typed NULLs. Each source gets a SELECT
that matches the merged output schema before the UNION runs.
Example (before):
UNION events, logs
Example (after):
WITH __union_0 = FROM events
| SELECT timestamp, event_type, message = CAST(NULL AS String)
WITH __union_1 = FROM logs
| SELECT timestamp, event_type = CAST(NULL AS String), message
UNION __union_0, __union_1
Pipeline-level passes
normalize_within — Rewrites WITHIN into explicit timestamp bounds using
WHERE. It preserves now() so evaluation happens at query time, not during
normalization. Ranges and intervals both become timestamp >= ... AND timestamp <= ....
Example (before):
FROM events
| WITHIN -7d
Example (after):
FROM events
| WHERE timestamp >= now() + -7d AND timestamp <= now()
normalize_agg — Replaces compound identifiers in AGG with temporary simple
names. It then restores the original compound name with LET and removes the
temp with DROP. Group-by compound identifiers follow the same pattern.
Example (before):
FROM events
| AGG stats.total = sum(value) BY category
Example (after):
FROM events
| AGG __normalize_agg_0 = sum(value) BY category
| LET stats.total = __normalize_agg_0
| DROP __normalize_agg_0
normalize_window — Same compound-identifier lowering as normalize_agg, but
for WINDOW. It ensures WINDOW projections and group-bys use simple identifiers
in the command itself.
Example (before):
FROM events
| WINDOW stats.running = sum(value) BY category
Example (after):
FROM events
| WINDOW __normalize_window_0 = sum(value) BY category
| LET stats.running = __normalize_window_0
| DROP __normalize_window_0
extract_window_aggregates — Ensures WINDOW projections only contain
top-level aggregate calls. Nested aggregates are extracted into synthetic fields,
then recombined with LET and cleaned up with DROP.
Example (before):
FROM events
| WINDOW crazy = sum(left) + sum(right) SORT BY order
Example (after):
FROM events
| WINDOW __window_agg_0 = sum(left), __window_agg_1 = sum(right) SORT BY order
| LET crazy = __window_agg_0 + __window_agg_1
| DROP __window_agg_0, __window_agg_1
normalize_explode — Canonicalizes EXPLODE to EXPLODE col = col. If the
identifier is compound or the expression differs from the column name, it inserts
the necessary LET/DROP steps so the explode happens in place.
Example (before):
EXPLODE items.expanded = array_field
Example (after):
LET __explode_0 = array_field
| EXPLODE __explode_0 = __explode_0
| LET items.expanded = __explode_0
| DROP __explode_0
lower_unnest — Lowers UNNEST to LET/DROP, with an EXPLODE step for
arrays of structs. Complex expressions are assigned to a temp column first. The
original column is always dropped unless the user preserved it earlier.
Example (before):
UNNEST arr
Example (after):
EXPLODE arr = arr
| LET a = arr.a, b = arr.b
| DROP arr
lower_parse — Converts PARSE anchor patterns into regexp_extract in a
LET, then filters non-matching rows with WHERE (unless NODROP is used).
Throwaway identifiers (_) are skipped entirely.
Example (before):
PARSE message "user=* ip=* status=*" AS user_id, src_ip, status
Example (after):
LET user_id = regexp_extract(message, "(?s)user=(.*?) ip=(.*?) status=(.*?)", 1),
src_ip = regexp_extract(message, "(?s)user=(.*?) ip=(.*?) status=(.*?)", 2),
status = regexp_extract(message, "(?s)user=(.*?) ip=(.*?) status=(.*?)", 3)
| WHERE regexp_count(message, "(?s)user=(.*?) ip=(.*?) status=(.*?)") > 0
lower_nest — Replaces NEST with a SELECT that assigns each field to a
compound identifier under the target name. This avoids struct literals in the
typed AST, and later projection fusion packs them into structs.
Example (before):
FROM events
| NEST user
Example (after):
FROM events
| SELECT user.a = a, user.b = b
expand_array_literals — Widen struct elements inside array literals to match
the array's element type by inserting typed NULLs. If expansion would duplicate
complex expressions, the pass hoists them into LET/DROP around the command.
Example (before):
LET arr = [{a: 1}, {a: 2, b: 3}]
Example (after):
LET arr = [{a: 1, b: CAST(NULL AS Int)}, {a: 2, b: 3}]
fuse_projections — Fuses consecutive LET/DROP/SELECT commands into the
minimal number of SELECT steps. It respects dependency barriers when a LET
references a field assigned earlier in the pending projection.
Example (before):
FROM events
| SELECT a, b
| LET c = a + b
| DROP b
Example (after):
FROM events
| SELECT c = a + b, a
align_append_schema — Inserts a SELECT before APPEND so the pipeline
output matches the target table schema (order, missing fields, and extra fields).
It uses typed NULLs for missing columns and runs only on the main pipeline.
Example (before):
LET a = 1, b = 2
| APPEND my_table
Example (after):
LET a = 1, b = 2
| SELECT b, c = CAST(NULL AS Int), a
| APPEND my_table
Query-time evaluation
IRExpression::freeze() evaluates constant subexpressions at query execution
time. This resolves now(), today(), and other time-dependent functions to
concrete timestamp values:
// Before freeze: WHERE timestamp >= now() + -5h
// After freeze: WHERE timestamp >= ts("2024-01-15T10:00:00Z")
let frozen_expr = ir_expr.freeze();
Incremental analysis
The incremental module analyzes queries for incremental execution:
use hamelin_translation::incremental::{analyze, IncrementalStrategy};
let analysis = analyze(&ir_statement, "timestamp")?;
match analysis.strategy {
IncrementalStrategy::Stateless => { /* Simple filter/project */ }
IncrementalStrategy::Windowed { .. } => { /* Requires time window */ }
}
Dependencies
~32MB
~521K SLoC