1 unstable release
new 0.2.1 | May 18, 2025 |
---|
#1354 in Database interfaces
Used in 2 crates
65KB
1.5K
SLoC
Azof
Query tables in object storage as of event time.
Azof is a lakehouse format with time-travel capabilities that allows you to query data as it existed at any point in time, based on when events actually occurred rather than when they were recorded.
What Problem Does Azof Solve?
Traditional data lakehouse formats allow time travel based on when data was written (processing time). Azof instead focuses on event time - the time when events actually occurred in the real world. This distinction is crucial for:
- Late-arriving data: Process data that arrives out of order without rewriting history
- Consistent historical views: Get consistent snapshots of data as it existed at specific points in time
- High cardinality datasets with frequent updates: Efficiently handle use cases involving business processes (sales, support, project management, financial data) or slowly changing dimensions
- Point-in-time analysis: Analyze the state of your data exactly as it was at any moment
Key Features
- Event time-based time travel: Query data based on when events occurred, not when they were recorded
- Efficient storage of updates: Preserves compacted snapshots of state to minimize storage and query overhead
- Hierarchical organization: Uses segments and delta files to efficiently organize temporal data
- Tunable compaction policy: Adjust based on your data distribution patterns
- SQL integration: Query using DataFusion with familiar SQL syntax
- Integration with object storage: Works with any object store (local, S3, etc.)
Project Structure
The Azof project is organized as a Rust workspace with multiple crates:
- azof: The core library providing the lakehouse format functionality
- azof-cli: A CLI utility demonstrating how to use the library
- azof-datafusion: DataFusion integration for SQL queries
Getting Started
To build all projects in the workspace:
cargo build --workspace
Using the CLI
The azof-cli provides a command-line interface for interacting with azof:
# Scan a table (current version)
cargo run -p azof-cli -- scan --path ./test-data --table table0
# Scan a table as of a specific event time
cargo run -p azof-cli -- scan --path ./test-data --table table0 --as-of "2024-03-15T14:30:00"
# Generate test parquet file from CSV
cargo run -p azof-cli -- gen --path ./test-data --table table2 --file base
DataFusion Integration
The azof-datafusion crate provides integration with Apache DataFusion, allowing you to:
- Register Azof tables in a DataFusion context
- Run SQL queries against Azof tables
- Perform time-travel queries using the AsOf functionality
Example
use azof_datafusion::context::ExecutionContext;
async fn query_azof() -> Result<(), Box<dyn std::error::Error>> {
let ctx = ExecutionContext::new("/path/to/azof");
let df = ctx
.sql(
"
SELECT key as symbol, revenue, net_income
FROM financials
AT ('2019-01-17T00:00:00.000Z')
WHERE industry IN ('Software')
ORDER BY revenue DESC
LIMIT 5;
",
)
.await?;
df.show().await?;
Ok(())
}
Run the example:
cargo run --example query_example -p azof-datafusion
If you install the CLI with cargo install --path crates/azof-cli
, you can run it directly with:
azof-cli scan --path ./test-data --table table0
Project Roadmap
Azof is under development. The goal is to implement a data lakehouse with the following capabilities:
- Atomic, non-concurrent writes (single writer)
- Consistent reads
- Schema evolution
- Event time travel queries
- Handling late-arriving data
- Integration with an execution engine
Milestone 0
- Script/tool for generating sample kv data set
- Key-value reader
- DataFusion table provider
Milestone 1
- Multiple columns support
- Data Types columns support
- Projection pushdown
- Projection pushdown in DataFusion table provider
- DataFusion table provider with AS OF operator
- Single row, key-value writer
- Document spec
- Delta -> snapshot compaction
- Metadata validity checks
Milestone 2
- Streaming in scan
- Schema definition and evolution
- Late-arriving data support
Dependencies
~34–47MB
~884K SLoC