Lib.rs

›

#parquet #cli #format #data-fusion #data-lake #azof #azof-cli #execution-context

app azof-cli

CLI utility for azof lakehouse format

1 unstable release

new 0.2.1	May 18, 2025

#14 in #data-lake

Apache-2.0

85KB
2K SLoC

Azof

Query tables in object storage as of event time.

Azof is a lakehouse format with time-travel capabilities that allows you to query data as it existed at any point in time, based on when events actually occurred rather than when they were recorded.

What Problem Does Azof Solve?

Traditional data lakehouse formats allow time travel based on when data was written (processing time). Azof instead focuses on event time - the time when events actually occurred in the real world. This distinction is crucial for:

Late-arriving data: Process data that arrives out of order without rewriting history
Consistent historical views: Get consistent snapshots of data as it existed at specific points in time
High cardinality datasets with frequent updates: Efficiently handle use cases involving business processes (sales, support, project management, financial data) or slowly changing dimensions
Point-in-time analysis: Analyze the state of your data exactly as it was at any moment

Key Features

Event time-based time travel: Query data based on when events occurred, not when they were recorded
Efficient storage of updates: Preserves compacted snapshots of state to minimize storage and query overhead
Hierarchical organization: Uses segments and delta files to efficiently organize temporal data
Tunable compaction policy: Adjust based on your data distribution patterns
SQL integration: Query using DataFusion with familiar SQL syntax
Integration with object storage: Works with any object store (local, S3, etc.)

Project Structure

The Azof project is organized as a Rust workspace with multiple crates:

azof: The core library providing the lakehouse format functionality
azof-cli: A CLI utility demonstrating how to use the library
azof-datafusion: DataFusion integration for SQL queries

Getting Started

To build all projects in the workspace:

cargo build --workspace

Using the CLI

The azof-cli provides a command-line interface for interacting with azof:

# Scan a table (current version)
cargo run -p azof-cli -- scan --path ./test-data --table table0

# Scan a table as of a specific event time
cargo run -p azof-cli -- scan --path ./test-data --table table0 --as-of "2024-03-15T14:30:00"

# Generate test parquet file from CSV
cargo run -p azof-cli -- gen --path ./test-data --table table2 --file base

DataFusion Integration

The azof-datafusion crate provides integration with Apache DataFusion, allowing you to:

Register Azof tables in a DataFusion context
Run SQL queries against Azof tables
Perform time-travel queries using the AsOf functionality

Example

use azof_datafusion::context::ExecutionContext;

async fn query_azof() -> Result<(), Box<dyn std::error::Error>> {
    let ctx = ExecutionContext::new("/path/to/azof");

    let df = ctx
        .sql(
            "
    SELECT key as symbol, revenue, net_income
      FROM financials
        AT ('2019-01-17T00:00:00.000Z')
     WHERE industry IN ('Software')
     ORDER BY revenue DESC
     LIMIT 5;
     ",
        )
        .await?;

    df.show().await?;

    Ok(())
}

Run the example:

cargo run --example query_example -p azof-datafusion

If you install the CLI with cargo install --path crates/azof-cli, you can run it directly with:

azof-cli scan --path ./test-data --table table0

Project Roadmap

Azof is under development. The goal is to implement a data lakehouse with the following capabilities:

Atomic, non-concurrent writes (single writer)
Consistent reads
Schema evolution
Event time travel queries
Handling late-arriving data
Integration with an execution engine

Milestone 0

Script/tool for generating sample kv data set
Key-value reader
DataFusion table provider

Milestone 1

Multiple columns support
Data Types columns support
Projection pushdown
Projection pushdown in DataFusion table provider
DataFusion table provider with AS OF operator
Single row, key-value writer
Document spec
Delta -> snapshot compaction
Metadata validity checks

Milestone 2

Streaming in scan
Schema definition and evolution
Late-arriving data support

Dependencies

~35–48MB
~1M SLoC