1 unstable release

new 0.2.1 May 18, 2025

#14 in #data-lake

Apache-2.0

85KB
2K SLoC

Azof

Rust CI Crates.io Documentation License

Query tables in object storage as of event time.

Azof is a lakehouse format with time-travel capabilities that allows you to query data as it existed at any point in time, based on when events actually occurred rather than when they were recorded.

What Problem Does Azof Solve?

Traditional data lakehouse formats allow time travel based on when data was written (processing time). Azof instead focuses on event time - the time when events actually occurred in the real world. This distinction is crucial for:

  • Late-arriving data: Process data that arrives out of order without rewriting history
  • Consistent historical views: Get consistent snapshots of data as it existed at specific points in time
  • High cardinality datasets with frequent updates: Efficiently handle use cases involving business processes (sales, support, project management, financial data) or slowly changing dimensions
  • Point-in-time analysis: Analyze the state of your data exactly as it was at any moment

Key Features

  • Event time-based time travel: Query data based on when events occurred, not when they were recorded
  • Efficient storage of updates: Preserves compacted snapshots of state to minimize storage and query overhead
  • Hierarchical organization: Uses segments and delta files to efficiently organize temporal data
  • Tunable compaction policy: Adjust based on your data distribution patterns
  • SQL integration: Query using DataFusion with familiar SQL syntax
  • Integration with object storage: Works with any object store (local, S3, etc.)

Project Structure

The Azof project is organized as a Rust workspace with multiple crates:

  • azof: The core library providing the lakehouse format functionality
  • azof-cli: A CLI utility demonstrating how to use the library
  • azof-datafusion: DataFusion integration for SQL queries

Getting Started

To build all projects in the workspace:

cargo build --workspace

Using the CLI

The azof-cli provides a command-line interface for interacting with azof:

# Scan a table (current version)
cargo run -p azof-cli -- scan --path ./test-data --table table0

# Scan a table as of a specific event time
cargo run -p azof-cli -- scan --path ./test-data --table table0 --as-of "2024-03-15T14:30:00"

# Generate test parquet file from CSV
cargo run -p azof-cli -- gen --path ./test-data --table table2 --file base

DataFusion Integration

The azof-datafusion crate provides integration with Apache DataFusion, allowing you to:

  1. Register Azof tables in a DataFusion context
  2. Run SQL queries against Azof tables
  3. Perform time-travel queries using the AsOf functionality

Example

use azof_datafusion::context::ExecutionContext;

async fn query_azof() -> Result<(), Box<dyn std::error::Error>> {
    let ctx = ExecutionContext::new("/path/to/azof");

    let df = ctx
        .sql(
            "
    SELECT key as symbol, revenue, net_income
      FROM financials
        AT ('2019-01-17T00:00:00.000Z')
     WHERE industry IN ('Software')
     ORDER BY revenue DESC
     LIMIT 5;
     ",
        )
        .await?;

    df.show().await?;

    Ok(())
}

Run the example:

cargo run --example query_example -p azof-datafusion

If you install the CLI with cargo install --path crates/azof-cli, you can run it directly with:

azof-cli scan --path ./test-data --table table0

Project Roadmap

Azof is under development. The goal is to implement a data lakehouse with the following capabilities:

  • Atomic, non-concurrent writes (single writer)
  • Consistent reads
  • Schema evolution
  • Event time travel queries
  • Handling late-arriving data
  • Integration with an execution engine

Milestone 0

  • Script/tool for generating sample kv data set
  • Key-value reader
  • DataFusion table provider

Milestone 1

  • Multiple columns support
  • Data Types columns support
  • Projection pushdown
  • Projection pushdown in DataFusion table provider
  • DataFusion table provider with AS OF operator
  • Single row, key-value writer
  • Document spec
  • Delta -> snapshot compaction
  • Metadata validity checks

Milestone 2

  • Streaming in scan
  • Schema definition and evolution
  • Late-arriving data support

Dependencies

~35–48MB
~1M SLoC