#metrics #arrow #aggregation #flight #query #cache #sql

bin+lib hyprstream

High-performance metrics storage and query service using Arrow Flight SQL

7 releases

new 0.1.0-alpha-6 Jan 7, 2025
0.1.0-alpha-1 Jan 6, 2025
0.0.1-pre-alpha2 Jan 6, 2025

#676 in Database interfaces

Download history 340/week @ 2025-01-03

340 downloads per month

Apache-2.0

155KB
2.5K SLoC

Hyprstream: Real-time Aggregation Windows and High-Performance Cache for Apache Arrow Flight SQL ๐Ÿš€

Rust

๐Ÿ“„ Read our DRAFT Technical Paper: Hyprstream: A Unified Architecture for Multimodal Data Processing and Real-Time Foundational Model Inference

โš ๏ธ PRE-RELEASE: This is a work in progress alpha and is not yet ready for production and is in early stages of development โš ๏ธ ๏ธ

๐ŸŒŸ Hyprstream is a next-generation application for real-time data ingestion, windowed aggregation, caching, and serving. Built on Apache Arrow Flight and DuckDB, and developed in Rust, Hyprstream dynamically calculates metrics like running sums, counts, and averages, enabling blazing-fast data workflows, intelligent caching, and seamless integration with ADBC-compliant datastores. Its real-time aggregation capabilities empower AI/ML pipelines and analytics with instant insights. ๐Ÿ’พโœจ

๐Ÿšง This product is in preview during rapid early development. While we're laying the groundwork for advertised capabilities, there are known bugs ๐Ÿ›, partially implemented features ๐Ÿ”จ, and frequent updates ahead ๐Ÿ”„. Your feedback and collaboration will be invaluable in shaping the project's direction ๐ŸŒฑ.

โœจ Key Features

๐Ÿ“ฅ Data Ingestion via Apache Arrow Flight

  • ๐Ÿ”„ Streamlined Ingestion: Ingests data efficiently using Arrow Flight, an advanced columnar data transport protocol
  • โšก Real-Time Streaming: Supports real-time metrics, datasets, and vectorized data for analytics and AI/ML workflows
  • ๐Ÿ’พ Write-Through to ADBC: Ensures data consistency with immediate caching and write-through to backend datastores

๐Ÿง  Intelligent Read Caching with DuckDB

  • โšก In-Memory Performance: Uses DuckDB for lightning-fast caching of frequently accessed data
  • ๐ŸŽฏ Optimized Querying: Stores query results and intermediate computations for analytics workloads
  • ๐Ÿ”„ Automatic Management: Handles caching transparently with configurable expiry policies

๐Ÿ“Š Real-Time Aggregation

  • ๐Ÿ“ˆ Dynamic Metrics: Maintains running sums, counts, and averages for real-time insights
  • โฑ๏ธ Time Window Partitioning: Supports fixed time windows (e.g., 5m, 30m, hourly, daily) for granular analysis
  • ๐ŸŽฏ Lightweight State: Maintains only aggregate states for efficient memory usage

๐ŸŒ Data Serving with Arrow Flight SQL

  • โšก High-Performance Queries: Serves cached data via Arrow Flight SQL for minimal latency
  • ๐Ÿ”ข Vectorized Data: Optimized for AI/ML pipelines and analytical queries
  • ๐Ÿ”Œ Seamless Integration: Connects with analytics and visualization tools

๐ŸŒŸ Benefits

  • โšก Low Latency: Millisecond-level query responses for cached data
  • ๐Ÿ“ˆ Scalable: Handles large-scale data workflows with ease
  • ๐Ÿ”— Flexible: Integrates with Postgres, Redis, Snowflake, and other ADBC datastores
  • ๐Ÿค– AI/ML Ready: Optimized for vectorized data and inference pipelines
  • ๐Ÿ“Š Real-Time Metrics: Dynamic calculation of statistical metrics
  • โฑ๏ธ Time Windows: Granular control of metrics with configurable windows
  • ๐Ÿฆ€ Rust-Powered: High-performance, memory-safe implementation

๐Ÿ”œ Coming Soon

Hyprstream is actively developing several exciting features:

๐Ÿง  Real-Time Model Integration

  • ๐Ÿ“ฆ Direct storage of foundational models in Arrow format
  • ๐Ÿš€ Zero-copy GPU access for model weights
  • ๐Ÿ”„ Layer-specific updates and fine-tuning

๐Ÿ”ฎ Advanced Processing

  • ๐Ÿ”„ Multimodal data fusion with real-time embedding generation
  • โšก CUDA-optimized operations with custom kernels
  • ๐Ÿ“Š Advanced time-series window operations
  • ๐ŸŽฅ Neural Radiance Fields (NERF) integration for video processing

๐Ÿš€ Performance & Scale

  • ๐Ÿ“ฆ Multi-tiered storage system with intelligent caching
  • ๐ŸŒ Distributed training and gradient accumulation
  • โšก GPU-accelerated query execution
  • ๐Ÿ”„ Predictive layer prefetching

๐Ÿ”’ Security & Privacy

  • ๐Ÿ” Encrypted model weight storage
  • ๐Ÿ›ก๏ธ Privacy-preserving training with differential privacy
  • ๐Ÿ“ Comprehensive audit logging
  • ๐Ÿ”‘ Fine-grained access control

For detailed technical information about these upcoming features, please refer to our technical paper.

๐Ÿ“Š Ecosystem Integration

Hyprstream is designed to work seamlessly with existing data infrastructure:

๐Ÿ”— Storage & Analytics

  • Works with any ADBC-compliant database (PostgreSQL, Snowflake, etc.) as a backend store
  • Uses DuckDB for high-performance caching and analytics
  • Integrates with Arrow ecosystem tools for data processing and analysis

๐Ÿ”„ Real-time Processing

  • Complements stream processing systems by providing fast caching layer
  • Can serve as a real-time metrics store for monitoring solutions
  • Enables quick access to recent data while maintaining historical records

๐Ÿค– AI/ML Pipeline Integration (Coming Soon)

  • Will provide zero-copy access to model weights and embeddings
  • Designed to work alongside vector databases and ML serving platforms
  • Future support for real-time model updates and fine-tuning

๐Ÿ› ๏ธ Developer Tools

  • Native Arrow Flight SQL support for seamless client integration
  • Compatible with popular data science tools and frameworks
  • Language-agnostic API for broad ecosystem compatibility

Hyprstream focuses on being a great citizen in the modern data stack, enhancing rather than replacing existing tools.

๐Ÿš€ Getting Started

  1. ๐Ÿ“ฅ Install Hyprstream:

    cargo install hyprstream
    
  2. ๐Ÿƒ Start the server with default configuration:

    hyprstream
    
  3. ๐Ÿ”Œ Use with PostgreSQL backend (requires PostgreSQL ADBC driver):

    # Set backend-specific credentials securely via environment variables
    export HYPRSTREAM_ENGINE_USERNAME=postgres
    export HYPRSTREAM_ENGINE_PASSWORD=secret
    
    # Start Hyprstream with connection details (but without credentials)
    hyprstream \
      --engine adbc \
      --engine-connection "postgresql://localhost:5432/metrics?pool_max=10&pool_min=1&connect_timeout=30" \
      --engine-options driver_path=/usr/local/lib/libadbc_driver_postgresql.so \
      --enable-cache \
      --cache-engine duckdb \
      --cache-connection ":memory:"
    

For configuration options and detailed documentation, run:

hyprstream --help

Or visit our ๐Ÿ“š API Documentation for comprehensive guides and examples.

๐Ÿ’ก Example Usage

๐Ÿš€ Quick Start with ADBC

Hyprstream implements the Arrow Flight SQL protocol, making it compatible with any ADBC-compliant client:

import adbc_driver_flightsql.dbapi

# Connect to Hyprstream using standard ADBC
conn = adbc_driver_flightsql.dbapi.connect("grpc://localhost:50051")

try:
    cursor = conn.cursor()
    
    # Query metrics with time windows
    cursor.execute("""
        SELECT 
            metric_id,
            COUNT(*) as samples,
            AVG(value_running_window_avg) as avg_value
        FROM metrics
        WHERE timestamp >= NOW() - INTERVAL '1 hour'
        GROUP BY metric_id
        ORDER BY avg_value DESC
    """)
    
    results = cursor.fetch_arrow_table()
    print(results.to_pandas())
    
finally:
    cursor.close()
    conn.close()

๐Ÿค Contributing

We welcome contributions! Please feel free to submit a Pull Request.

๐Ÿ“„ License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.


For inquiries or support, contact us at ๐Ÿ“ง support@hyprstream.com or visit our GitHub repository to contribute! ๐ŸŒ

Dependencies

~71MB
~1M SLoC