1 unstable release

new 0.1.0-alpha.4 Jun 15, 2025
0.1.0-alpha.3 Jun 10, 2025
0.1.0-alpha.2 May 12, 2025
0.1.0-alpha.1 Apr 18, 2025

#836 in Math

Download history 140/week @ 2025-04-14 24/week @ 2025-04-21 9/week @ 2025-04-28 7/week @ 2025-05-05 182/week @ 2025-05-12 7/week @ 2025-05-19 1/week @ 2025-05-26 3/week @ 2025-06-02 183/week @ 2025-06-09

208 downloads per month
Used in quantrs2-sim

Apache-2.0

4MB
84K SLoC

Contains (rust library, 43KB) libast.rlib

PandRS

Rust CI License: MIT OR Apache-2.0 Crate

A high-performance DataFrame library for data analysis implemented in Rust. It has features and design inspired by Python's pandas library, combining fast data processing with type safety and distributed computing capabilities.

πŸš€ What's New

Enhanced DataFrame Operations:

  • Column Management: New rename_columns() and set_column_names() methods for flexible column renaming
  • Enhanced I/O: Real data extraction in Parquet/SQL operations with improved type safety
  • Distributed Processing: Production-ready DataFusion integration with schema validation and fault tolerance
  • Python Bindings: Complete feature coverage with optimized string pool integration

Performance & Reliability:

  • Comprehensive Testing: 100+ integration tests covering all major features and edge cases
  • Schema Validation: Type-safe distributed operations with compile-time validation
  • Fault Tolerance: Checkpoint/recovery system for robust distributed processing
  • Memory Optimization: Advanced string pool with up to 89.8% memory reduction

🏁 Quick Start

use pandrs::{DataFrame, OptimizedDataFrame, Column, StringColumn, Int64Column};
use std::collections::HashMap;

// Create DataFrame with column management
let mut df = DataFrame::new();
df.add_column("name".to_string(), 
    pandrs::series::Series::from_vec(vec!["Alice", "Bob", "Carol"], Some("name")))?;

// Rename columns with a mapping
let mut rename_map = HashMap::new();
rename_map.insert("name".to_string(), "employee_name".to_string());
df.rename_columns(&rename_map)?;

// Set all column names at once
df.set_column_names(vec!["person_name".to_string()])?;

// Distributed processing with DataFusion
use pandrs::distributed::DistributedContext;
let mut context = DistributedContext::new_local(4)?;
context.register_dataframe("people", &df)?;
let result = context.sql("SELECT * FROM people WHERE person_name LIKE 'A%'")?;

Key Features

  • πŸ”₯ High-Performance Processing: Column-oriented storage with up to 5x faster aggregations
  • 🧠 Memory Efficient: String pool optimization reducing memory usage by up to 89.8%
  • ⚑ Multi-core & GPU: Parallel processing + CUDA acceleration (up to 20x speedup)
  • 🌐 Distributed Computing: DataFusion-powered distributed processing for large datasets
  • 🐍 Python Integration: Full PyO3 bindings with pandas interoperability
  • πŸ”’ Type Safety: Rust's ownership system ensuring memory safety and thread safety
  • πŸ“Š Rich Analytics: Statistical functions, ML metrics, and categorical data analysis
  • πŸ’Ύ Flexible I/O: Parquet, CSV, JSON, SQL, and Excel support with real data extraction

Features

  • Series (1-dimensional array) and DataFrame (2-dimensional table) data structures
  • Support for missing values (NA)
  • Grouping and aggregation operations
  • Row labels with indexes
  • Multi-level indexes (hierarchical indexes)
  • CSV/JSON reading and writing
  • Parquet data format support
  • Basic operations (filtering, sorting, joining, etc.)
  • Aggregation functions for numeric data
  • Special operations for string data
  • Basic time series data processing
  • Categorical data types (efficient memory use, ordered categories)
  • Pivot tables
  • Visualization with text-based and high-quality graphs
  • Parallel processing support
  • Statistical analysis functions (descriptive statistics, t-tests, regression analysis, etc.)
  • Specialized categorical data statistics (contingency tables, chi-square tests, etc.)
  • Machine learning evaluation metrics (MSE, RΒ², accuracy, F1, etc.)
  • Optimized implementation (column-oriented storage, lazy evaluation, string pool)
  • High-performance split implementation (sub-modularized files for each functionality)
  • GPU acceleration for matrix operations, statistical computation, and ML algorithms
  • Disk-based processing for very large datasets
  • Streaming data support with real-time analytics
  • Distributed processing for datasets exceeding single-machine capacity (experimental)
    • SQL-like query interface with DataFusion
    • Window function support for analytics
    • Advanced expression system for complex transformations
    • User-defined function support

Usage Examples

Creating and Basic Operations with DataFrames

use pandrs::{DataFrame, Series};

// Create series
let ages = Series::new(vec![30, 25, 40], Some("age".to_string()))?;
let heights = Series::new(vec![180, 175, 182], Some("height".to_string()))?;

// Add series to DataFrame
let mut df = DataFrame::new();
df.add_column("age".to_string(), ages)?;
df.add_column("height".to_string(), heights)?;

// Save as CSV
df.to_csv("data.csv")?;

// Load DataFrame from CSV
let df_from_csv = DataFrame::from_csv("data.csv", true)?;

// DataFrame column management
use std::collections::HashMap;

// Rename specific columns using a mapping
let mut rename_map = HashMap::new();
rename_map.insert("age".to_string(), "years_old".to_string());
rename_map.insert("height".to_string(), "height_cm".to_string());
df.rename_columns(&rename_map)?;

// Set all column names at once
df.set_column_names(vec!["person_age".to_string(), "person_height".to_string()])?;

Numeric Operations and Series Management

// Create numeric series
let mut numbers = Series::new(vec![10, 20, 30, 40, 50], Some("values".to_string()))?;

// Statistical calculations
let sum = numbers.sum();         // 150
let mean = numbers.mean()?;      // 30
let min = numbers.min()?;        // 10
let max = numbers.max()?;        // 50

// Series name management
numbers.set_name("updated_values".to_string());
assert_eq!(numbers.name(), Some(&"updated_values".to_string()));

// Create series with fluent API
let named_series = Series::new(vec![1, 2, 3], None)?
    .with_name("my_series".to_string());

// Convert series types with name preservation
let string_series = numbers.to_string_series()?;

Installation

Add the following to your Cargo.toml:

[dependencies]
pandrs = "0.1.0-alpha.4"

For GPU acceleration, add the CUDA feature flag (requires CUDA toolkit installation):

[dependencies]
pandrs = { version = "0.1.0-alpha.4", features = ["cuda"] }

Note: The CUDA feature requires NVIDIA CUDA toolkit to be installed on your system.

For distributed processing capabilities, add the distributed feature:

[dependencies]
pandrs = { version = "0.1.0-alpha.4", features = ["distributed"] }

Multiple features can be combined:

[dependencies]
pandrs = { version = "0.1.0-alpha.4", features = ["cuda", "distributed", "wasm"] }

Working with Missing Values (NA)

// Create series with NA values
let data = vec![
    NA::Value(10), 
    NA::Value(20), 
    NA::NA,  // missing value
    NA::Value(40)
];
let series = NASeries::new(data, Some("values".to_string()))?;

// Handle NA values
println!("Number of NAs: {}", series.na_count());
println!("Number of values: {}", series.value_count());

// Drop and fill NA values
let dropped = series.dropna()?;
let filled = series.fillna(0)?;

Group Operations

// Data and group keys
let values = Series::new(vec![10, 20, 15, 30, 25], Some("values".to_string()))?;
let keys = vec!["A", "B", "A", "C", "B"];

// Group and aggregate
let group_by = GroupBy::new(
    keys.iter().map(|s| s.to_string()).collect(),
    &values,
    Some("by_category".to_string())
)?;

// Aggregation results
let sums = group_by.sum()?;
let means = group_by.mean()?;

Time Series Operations

use pandrs::temporal::{TimeSeries, date_range, Frequency};
use chrono::NaiveDate;

// Generate date range
let dates = date_range(
    NaiveDate::from_str("2023-01-01")?,
    NaiveDate::from_str("2023-01-31")?,
    Frequency::Daily,
    true
)?;

// Create time series data
let time_series = TimeSeries::new(values, dates, Some("daily_data".to_string()))?;

// Time filtering
let filtered = time_series.filter_by_time(
    &NaiveDate::from_str("2023-01-10")?,
    &NaiveDate::from_str("2023-01-20")?
)?;

// Calculate moving average
let moving_avg = time_series.rolling_mean(3)?;

// Resampling (convert to weekly)
let weekly = time_series.resample(Frequency::Weekly).mean()?;

Distributed Processing (Experimental)

Using the DataFrame-Style API

use pandrs::{DataFrame, distributed::{DistributedConfig, ToDistributed}};

// Create a local DataFrame
let mut df = DataFrame::new();
df.add_column("id".to_string(), (0..10000).collect::<Vec<i64>>())?;
df.add_column("value".to_string(), (0..10000).map(|x| x as f64 * 1.5).collect::<Vec<f64>>())?;
df.add_column("group".to_string(), (0..10000).map(|x| (x % 100).to_string()).collect::<Vec<String>>())?;

// Configure distributed processing
let config = DistributedConfig::new()
    .with_executor("datafusion")  // Use DataFusion engine
    .with_concurrency(4);         // Use 4 threads

// Convert to distributed DataFrame
let dist_df = df.to_distributed(config)?;

// Define distributed operations (lazy execution)
let result = dist_df
    .filter("value > 5000.0")?
    .groupby(&["group"])?
    .aggregate(&["value"], &["mean"])?;

// Execute operations and get performance metrics
let executed = result.execute()?;
if let Some(metrics) = executed.execution_metrics() {
    println!("Execution time: {}ms", metrics.execution_time_ms());
    println!("Rows processed: {}", metrics.rows_processed());
}

// Check execution summary
println!("{}", executed.summarize()?);

// Collect results back to local DataFrame
let final_df = executed.collect_to_local()?;
println!("Result: {}", final_df);

// Write results directly to Parquet
executed.write_parquet("filtered_results.parquet")?;

Using the SQL-Style API with DistributedContext

use pandrs::{DataFrame, distributed::{DistributedContext, DistributedConfig}};

// Create a distributed context
let mut context = DistributedContext::new(
    DistributedConfig::new()
        .with_executor("datafusion")
        .with_concurrency(4)
)?;

// Register multiple DataFrames with the context
let customers = DataFrame::new(); // Create customers DataFrame
let orders = DataFrame::new();    // Create orders DataFrame
context.register_dataframe("customers", &customers)?;
context.register_dataframe("orders", &orders)?;

// Also register CSV or Parquet files directly
context.register_csv("products", "products.csv")?;
context.register_parquet("sales", "sales.parquet")?;

// Execute SQL queries against registered datasets
let result = context.sql_to_dataframe("
    SELECT c.customer_id, c.name, COUNT(o.order_id) as order_count, SUM(o.amount) as total_amount
    FROM customers c
    LEFT JOIN orders o ON c.customer_id = o.customer_id
    GROUP BY c.customer_id, c.name
    ORDER BY total_amount DESC
")?;

println!("Query results: {}", result);

// Execute query and write directly to Parquet
let metrics = context.sql_to_parquet("
    SELECT p.product_id, p.name, SUM(s.quantity) as total_sold
    FROM products p
    JOIN sales s ON p.product_id = s.product_id
    GROUP BY p.product_id, p.name
", "product_sales_summary.parquet")?;

println!("Query execution metrics:\n{}", metrics.format());

Using Window Functions for Analytics

use pandrs::{DataFrame, distributed::{DistributedConfig, ToDistributed, WindowFunctionExt}};
use pandrs::distributed_window; // Access window functions

// Create a DataFrame with time series data
let mut df = DataFrame::new();
// ... add columns with time series data ...

// Convert to distributed DataFrame
let dist_df = df.to_distributed(DistributedConfig::new())?;

// Add ranking by sales within each region
let ranked = dist_df.window(&[
    distributed_window::rank(
        "sales_rank",       // Output column
        &["region"],        // Partition by region
        &[("sales", false)] // Order by sales descending
    )
])?;

// Calculate running total of sales by date
let running_total = dist_df.window(&[
    distributed_window::cumulative_sum(
        "sales",          // Input column
        "running_total",  // Output column
        &["region"],      // Partition by region
        &[("date", true)] // Order by date ascending
    )
])?;

// Calculate 7-day moving average
let moving_avg = dist_df.window(&[
    distributed_window::rolling_avg(
        "sales",             // Input column
        "sales_7day_avg",    // Output column
        &[],                 // No partitioning
        &[("date", true)],   // Order by date
        7                    // Window size of 7 days
    )
])?;

// Or use SQL directly
let context = DistributedContext::new(DistributedConfig::new())?;
context.register_dataframe("sales", &df)?;

let result = context.sql_to_dataframe("
    SELECT
        date, region, product, sales,
        RANK() OVER (PARTITION BY region ORDER BY sales DESC) as sales_rank,
        SUM(sales) OVER (PARTITION BY region ORDER BY date) as running_total,
        AVG(sales) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) as moving_avg_7day
    FROM sales
")?;

Using Expressions for Complex Data Transformations

use pandrs::{DataFrame, distributed::{DistributedConfig, ToDistributed, Expr, ColumnProjection}};

// Create a DataFrame with sales data
let mut df = DataFrame::new();
// ... add columns with sales data ...

// Convert to distributed DataFrame
let dist_df = df.to_distributed(DistributedConfig::new())?;

// Select columns with complex expressions
let selected = dist_df.select_expr(&[
    // Simple column selection
    ColumnProjection::column("region"),
    ColumnProjection::column("sales"),
    // Calculate a new column with expression
    ColumnProjection::with_alias(
        Expr::col("sales").mul(Expr::lit(1.1)),
        "sales_with_bonus"
    ),
    // Calculate profit margin percentage
    ColumnProjection::with_alias(
        Expr::col("profit").div(Expr::col("sales")).mul(Expr::lit(100.0)),
        "profit_margin"
    ),
])?;

// Filter using complex expressions
let high_margin = dist_df
    .filter_expr(
        Expr::col("profit")
            .div(Expr::col("sales"))
            .mul(Expr::lit(100.0))
            .gt(Expr::lit(15.0))
    )?;

// Create a user-defined function
let commission_udf = UdfDefinition::new(
    "calculate_commission",
    ExprDataType::Float,
    vec![ExprDataType::Float, ExprDataType::Float],
    "CASE
        WHEN param1 / param0 > 0.2 THEN param0 * 0.05
        WHEN param1 / param0 > 0.1 THEN param0 * 0.03
        ELSE param0 * 0.01
     END"
);

// Register and use the UDF
let with_commission = dist_df
    .create_udf(&[commission_udf])?
    .select_expr(&[
        ColumnProjection::column("region"),
        ColumnProjection::column("sales"),
        ColumnProjection::column("profit"),
        ColumnProjection::with_alias(
            Expr::call("calculate_commission", vec![
                Expr::col("sales"),
                Expr::col("profit"),
            ]),
            "commission"
        ),
    ])?;

// Chain operations together
let final_analysis = dist_df
    // Add calculated columns
    .with_column("profit_pct",
        Expr::col("profit").div(Expr::col("sales")).mul(Expr::lit(100.0))
    )?
    // Filter for high-profit regions
    .filter_expr(
        Expr::col("profit_pct").gt(Expr::lit(12.0))
    )?
    // Project final columns with calculations
    .select_expr(&[
        ColumnProjection::column("region"),
        ColumnProjection::column("product"),
        ColumnProjection::column("profit_pct"),
        // Calculate a bonus amount
        ColumnProjection::with_alias(
            Expr::col("profit")
                .mul(Expr::lit(0.1))
                .add(Expr::col("profit_pct").mul(Expr::lit(5.0))),
            "bonus"
        ),
    ])?;

Statistical Analysis and Machine Learning Evaluation Functions

use pandrs::{DataFrame, Series, stats};
use pandrs::ml::metrics::regression::{mean_squared_error, r2_score};
use pandrs::ml::metrics::classification::{accuracy_score, f1_score};

// Descriptive statistics
let data = vec![1.0, 2.0, 3.0, 4.0, 5.0];
let stats_summary = stats::describe(&data)?;
println!("Mean: {}, Standard deviation: {}", stats_summary.mean, stats_summary.std);
println!("Median: {}, Quartiles: {} - {}", stats_summary.median, stats_summary.q1, stats_summary.q3);

// Calculate correlation coefficient
let x = vec![1.0, 2.0, 3.0, 4.0, 5.0];
let y = vec![2.0, 3.0, 4.0, 5.0, 6.0];
let correlation = stats::correlation(&x, &y)?;
println!("Correlation coefficient: {}", correlation);

// Run t-test
let sample1 = vec![1.0, 2.0, 3.0, 4.0, 5.0];
let sample2 = vec![2.0, 3.0, 4.0, 5.0, 6.0];
let alpha = 0.05; // significance level
let result = stats::ttest(&sample1, &sample2, alpha, true)?;
println!("t-statistic: {}, p-value: {}", result.statistic, result.pvalue);
println!("Significant difference: {}", result.significant);

// Regression analysis
let mut df = DataFrame::new();
df.add_column("x1".to_string(), Series::new(vec![1.0, 2.0, 3.0, 4.0, 5.0], Some("x1".to_string()))?)?;
df.add_column("x2".to_string(), Series::new(vec![2.0, 3.0, 4.0, 5.0, 6.0], Some("x2".to_string()))?)?;
df.add_column("y".to_string(), Series::new(vec![3.0, 5.0, 7.0, 9.0, 11.0], Some("y".to_string()))?)?;

let model = stats::linear_regression(&df, "y", &["x1", "x2"])?;
println!("Coefficients: {:?}", model.coefficients());
println!("Coefficient of determination: {}", model.r_squared());

// Machine learning model evaluation - regression metrics
let y_true = vec![3.0, 5.0, 2.5, 7.0, 10.0];
let y_pred = vec![2.8, 4.8, 2.7, 7.2, 9.8];

let mse = mean_squared_error(&y_true, &y_pred)?;
let r2 = r2_score(&y_true, &y_pred)?;
println!("MSE: {:.4}, RΒ²: {:.4}", mse, r2);

// Machine learning model evaluation - classification metrics
let true_labels = vec![true, false, true, true, false, false];
let pred_labels = vec![true, false, false, true, true, false];

let accuracy = accuracy_score(&true_labels, &pred_labels)?;
let f1 = f1_score(&true_labels, &pred_labels)?;
println!("Accuracy: {:.2}, F1 Score: {:.2}", accuracy, f1);

Pivot Tables and Grouping

use pandrs::pivot::AggFunction;

// Grouping and aggregation
let grouped = df.groupby("category")?;
let category_sum = grouped.sum(&["sales"])?;

// Pivot table
let pivot_result = df.pivot_table(
    "category",   // index column
    "region",     // column column
    "sales",      // value column
    AggFunction::Sum
)?;

Categorical Data Analysis

use pandrs::{DataFrame, stats};
use pandrs::series::{Series, Categorical};

// Create categorical data
let mut df = DataFrame::new();
df.add_column("gender".to_string(), Series::new(vec!["M", "F", "M", "F", "M"], Some("gender".to_string()))?)?;
df.add_column("response".to_string(), Series::new(vec!["Yes", "No", "Yes", "Yes", "No"], Some("response".to_string()))?)?;

// Convert columns to categorical
df.convert_to_categorical("gender")?;
df.convert_to_categorical("response")?;

// Create contingency table
let contingency = stats::contingency_table_from_df(&df, "gender", "response")?;
println!("Contingency table:\n{}", contingency);

// Chi-square test for independence
let chi2_result = stats::chi_square_independence(&df, "gender", "response", 0.05)?;
println!("Chi-square statistic: {}", chi2_result.chi2_statistic);
println!("p-value: {}", chi2_result.p_value);
println!("Significant association: {}", chi2_result.significant);

// Measure of association
let cramers_v = stats::cramers_v_from_df(&df, "gender", "response")?;
println!("Cramer's V: {}", cramers_v);

// Test association between categorical and numeric variables
df.add_column("score".to_string(), Series::new(vec![85, 92, 78, 95, 88], Some("score".to_string()))?)?;
let anova_result = stats::categorical_anova_from_df(&df, "gender", "score", 0.05)?;
println!("F-statistic: {}", anova_result.f_statistic);
println!("p-value: {}", anova_result.p_value);

Development Plan and Implementation Status

  • Basic DataFrame structure
  • Series implementation
  • Index functionality
  • CSV input/output
  • JSON input/output
  • Parquet format support
  • Missing value handling
  • Grouping operations
  • Time series data support
    • Date range generation
    • Time filtering
    • Moving average calculation
    • Frequency conversion (resampling)
  • Pivot tables
  • Complete implementation of join operations
    • Inner join (internal match)
    • Left join (left side priority)
    • Right join (right side priority)
    • Outer join (all rows)
  • Visualization functionality integration
    • Line graphs
    • Scatter plots
    • Text plot output
  • Parallel processing support
    • Parallel conversion of Series/NASeries
    • Parallel processing of DataFrames
    • Parallel filtering (1.15x speedup)
    • Parallel aggregation (3.91x speedup)
    • Parallel computation processing (1.37x speedup)
    • Adaptive parallel processing (automatic selection based on data size)
  • Enhanced visualization
    • Text-based plots with textplots (line, scatter)
    • High-quality graph output with plotters (PNG, SVG format)
    • Various graph types (line, scatter, bar, histogram, area)
    • Graph customization options (size, color, grid, legend)
    • Intuitive plot API for Series, DataFrame
  • Multi-level indexes
    • Hierarchical index structure
    • Data grouping by multiple levels
    • Level operations (swap, select)
  • Categorical data types
    • Memory-efficient encoding
    • Support for ordered and unordered categories
    • Complete integration with NA values (missing values)
  • Advanced DataFrame operations
    • Long-form and wide-form conversion (melt, stack, unstack)
    • Conditional aggregation
    • DataFrame concatenation
  • Memory usage optimization
    • String pool optimization (up to 89.8% memory reduction)
    • Categorical encoding (2.59x performance improvement)
    • Global string pool implementation
    • Improved memory locality with column-oriented storage
  • Python bindings
    • Python module with PyO3
    • Interoperability with numpy and pandas
    • Jupyter Notebook support
    • Speedup with string pool optimization (up to 3.33x)
  • Distributed processing enhancements
    • SQL-like API with DistributedContext
    • Window function support
    • Expression system for complex transformations
    • User-defined function registration and use
  • Lazy evaluation system
    • Operation optimization with computation graph
    • Operation fusion
    • Avoiding unnecessary intermediate results
  • Statistical analysis features
    • Descriptive statistics (mean, standard deviation, quantiles, etc.)
    • Correlation coefficient and covariance
    • Hypothesis testing (t-test)
    • Regression analysis (simple and multiple regression)
    • Sampling methods (bootstrap, etc.)
  • Machine learning evaluation metrics
    • Regression evaluation (MSE, MAE, RMSE, RΒ² score)
    • Classification evaluation (accuracy, precision, recall, F1 score)
  • Codebase maintainability improvements
    • File separation of OptimizedDataFrame by functionality
    • API compatibility maintained through re-exports
    • Independent implementation of ML metrics module
  • GPU acceleration
    • CUDA integration for numerical operations
    • GPU-accelerated matrix operations
    • GPU-accelerated statistical functions
    • GPU-accelerated machine learning algorithms
    • Comprehensive benchmarking utility
    • Python bindings for GPU acceleration

Multi-level Index Operations

use pandrs::{DataFrame, MultiIndex};

// Create MultiIndex from tuples
let tuples = vec![
    vec!["A".to_string(), "a".to_string()],
    vec!["A".to_string(), "b".to_string()],
    vec!["B".to_string(), "a".to_string()],
    vec!["B".to_string(), "b".to_string()],
];

// Set level names
let names = Some(vec![Some("first".to_string()), Some("second".to_string())]);
let multi_idx = MultiIndex::from_tuples(tuples, names)?;

// Create DataFrame with MultiIndex
let mut df = DataFrame::with_multi_index(multi_idx);

// Add data
let data = vec!["data1".to_string(), "data2".to_string(), "data3".to_string(), "data4".to_string()];
df.add_column("data".to_string(), pandrs::Series::new(data, Some("data".to_string()))?)?;

// Level operations
let level0_values = multi_idx.get_level_values(0)?;
let level1_values = multi_idx.get_level_values(1)?;

// Swap levels
let swapped_idx = multi_idx.swaplevel(0, 1)?;

GPU Acceleration

use pandrs::gpu::{self, operations::{GpuMatrix, GpuVector}};
use pandrs::dataframe::gpu::DataFrameGpuExt;
use ndarray::{Array1, Array2};

// Initialize GPU
let device_status = gpu::init_gpu()?;
println!("GPU available: {}", device_status.available);

// Create matrices for GPU operations
let a_data = Array2::from_shape_vec((1000, 500),
    (0..500000).map(|i| i as f64).collect())?;
let b_data = Array2::from_shape_vec((500, 1000),
    (0..500000).map(|i| i as f64).collect())?;

// Create GPU matrices
let a = GpuMatrix::new(a_data);
let b = GpuMatrix::new(b_data);

// Perform GPU-accelerated matrix multiplication
let result = a.dot(&b)?;

// GPU-accelerated DataFrame operations
let mut df = DataFrame::new();
// Add data to DataFrame...

// Using GPU-accelerated correlation matrix
let corr_matrix = df.gpu_corr(&["col1", "col2", "col3"])?;

// GPU-accelerated linear regression
let model = df.gpu_linear_regression("y", &["x1", "x2", "x3"])?;

// GPU-accelerated PCA
let (components, explained_variance, transformed) = df.gpu_pca(&["x1", "x2", "x3"], 2)?;

// Benchmarking CPU vs GPU performance
let mut benchmark = gpu::benchmark::GpuBenchmark::new()?;
let matrix_multiply_result = benchmark.benchmark_matrix_multiply(1000, 1000, 1000)?;
println!("Matrix multiplication speedup: {}x", matrix_multiply_result.speedup.unwrap_or(0.0));

Python Binding Usage Examples

import pandrs as pr
import numpy as np
import pandas as pd

# Create optimized DataFrame
df = pr.OptimizedDataFrame()
df.add_int_column('A', [1, 2, 3, 4, 5])
df.add_string_column('B', ['a', 'b', 'c', 'd', 'e'])
df.add_float_column('C', [1.1, 2.2, 3.3, 4.4, 5.5])

# Traditional API compatible interface
df2 = pr.DataFrame({
    'A': [1, 2, 3, 4, 5],
    'B': ['a', 'b', 'c', 'd', 'e'],
    'C': [1.1, 2.2, 3.3, 4.4, 5.5]
})

# Interoperability with pandas
pd_df = df.to_pandas()  # Convert from PandRS to pandas DataFrame
pr_df = pr.OptimizedDataFrame.from_pandas(pd_df)  # Convert from pandas DataFrame to PandRS

# Using lazy evaluation
lazy_df = pr.LazyFrame(df)
result = lazy_df.filter('A').select(['B', 'C']).execute()

# Direct use of string pool
string_pool = pr.StringPool()
idx1 = string_pool.add("repeated_value")
idx2 = string_pool.add("repeated_value")  # Returns the same index
print(string_pool.get(idx1))  # Returns "repeated_value"

# GPU acceleration in Python
pr.gpu.init_gpu()  # Initialize GPU
gpu_df = df.gpu_accelerate()
corr_matrix = gpu_df.gpu_corr(['A', 'C'])
pca_result = gpu_df.gpu_pca(['A', 'C'], n_components=1)

# CSV input/output
df.to_csv('data.csv')
df_loaded = pr.OptimizedDataFrame.read_csv('data.csv')

# NumPy integration
series = df['A']
np_array = series.to_numpy()

# Jupyter Notebook support
from pandrs.jupyter import display_dataframe
display_dataframe(df, max_rows=10, max_cols=5)

Performance Optimization Results

The implementation of optimized column-oriented storage, lazy evaluation system, and GPU acceleration has achieved significant performance improvements:

Performance Comparison of Key Operations

Operation Traditional Implementation Optimized Implementation Speedup
Series/Column Creation 198.446ms 149.528ms 1.33x
DataFrame Creation (1 million rows) NA NA NA
Filtering 596.146ms 161.816ms 3.68x
Group Aggregation 544.384ms 107.837ms 5.05x

GPU Acceleration Performance (vs CPU)

Operation Data Size CPU Time GPU Time Speedup
Matrix Multiplication 1000x1000 232.8 ms 11.5 ms 20.2x
Element-wise Addition 2000x2000 18.6 ms 2.3 ms 8.1x
Correlation Matrix 10000x10 89.4 ms 12.1 ms 7.4x
Linear Regression 10000x10 124.3 ms 18.7 ms 6.6x
Rolling Window 100000, window=100 58.2 ms 9.8 ms 5.9x

String Processing Optimization

Mode Processing Time vs Traditional Notes
Legacy Mode 596.50ms 1.00x Traditional implementation
Categorical Mode 230.11ms 2.59x Categorical optimization
Optimized Implementation 232.38ms 2.57x Optimizer selection

Parallel Processing Performance Improvements

Operation Serial Processing Parallel Processing Speedup
Group Aggregation 696.85ms 178.09ms 3.91x
Filtering 201.35ms 175.48ms 1.15x
Computation 15.41ms 11.23ms 1.37x

Python Bindings String Optimization

Data Size Unique Rate Without Pool With Pool Processing Speedup Memory Reduction
100,000 rows 1% (high duplication) 82ms 35ms 2.34x 88.6%
1,000,000 rows 1% (high duplication) 845ms 254ms 3.33x 89.8%

πŸ†• Major Features

πŸ”§ Enhanced DataFrame API

  • Column Management: New rename_columns() and set_column_names() methods for flexible DataFrame schema management
  • Series Operations: Enhanced name management with set_name() and with_name() methods
  • Type Conversions: Improved type conversion utilities like to_string_series() for Series
  • API Consistency: Fluent interface design across all DataFrame operations

🌐 Production-Ready Distributed Processing

  • DataFusion Integration: Complete integration with Apache DataFusion for scalable data processing
  • Schema Validation: Compile-time schema validation preventing runtime errors
  • Fault Tolerance: Checkpoint and recovery system for robust distributed operations
  • SQL Interface: Full SQL query support with complex JOIN, window functions, and UDFs

πŸ’Ύ Enhanced Data I/O

  • Real Data Extraction: Parquet and SQL I/O operations now use actual data instead of placeholders
  • Type Safety: Improved Arrow integration with proper null value handling
  • Performance: Optimized data conversion processes for better throughput

🐍 Complete Python Bindings

  • Feature Parity: All features available through Python bindings

  • Pandas Integration: Seamless interoperability with pandas DataFrames

  • Memory Optimization: String pool integration reducing Python memory overhead

  • Comprehensive Test Coverage

    • 26 new edge case tests covering boundary conditions, error handling, and invalid inputs
    • Stress testing for large datasets (100K+ rows) and concurrent operations
    • String pool concurrency testing with thread safety validation
    • Memory management and resource cleanup testing
    • Unicode and special character support validation
    • Null value handling with proper bitmask validation
  • Enhanced Parquet and SQL Support

    • Real data extraction in Parquet write operations (replacing dummy implementations)
    • Improved column data access patterns for all data types
    • Better Arrow integration with proper null value handling
    • Enhanced type safety in data conversion processes
  • GPU Acceleration Integration

    • CUDA-based acceleration for performance-critical operations
    • Up to 20x speedup for matrix operations
    • GPU-accelerated statistical functions and ML algorithms
    • Transparent CPU fallback when GPU is unavailable
    • Comprehensive benchmarking utility for CPU vs GPU performance comparison
    • Python bindings for GPU functionality
  • Specialized Categorical Data Statistics

    • Comprehensive contingency table implementation
    • Chi-square test for independence
    • Cramer's V measure of association
    • Categorical ANOVA for comparing means across categories
    • Entropy and mutual information calculations
    • Statistical summaries specific to categorical variables
  • Large-Scale Data Processing

    • Disk-based processing for datasets larger than available memory
    • Memory-mapped file support for efficient large data access
    • Chunked processing capabilities for scalable data operations
    • Spill-to-disk functionality when memory limits are reached
  • Streaming Data Support

    • Real-time data processing interfaces
    • Stream connectors for various data sources
    • Windowed operations on streaming data
    • Real-time analytics capabilities
  • Column-Oriented Storage Engine

    • Type-specialized column implementation (Int64Column, Float64Column, StringColumn, BooleanColumn)
    • Improved cache efficiency through memory locality
    • Operation acceleration and parallel processing efficiency
  • String Processing Optimization

    • Elimination of duplicate strings with global string pool
    • String to index conversion with categorical encoding
    • Consistent API design and multiple optimization modes
  • Lazy Evaluation System Implementation

    • Operation pipelining with computation graph
    • Avoiding unnecessary intermediate results
    • Improved efficiency through operation fusion
  • Significant Parallel Processing Improvements

    • Efficient multi-threading with Rayon
    • Adaptive parallel processing (automatic selection based on data size)
    • Chunk processing optimization
  • Enhanced Python Integration

    • Efficient data conversion between Python and Rust with string pool optimization
    • Utilization of NumPy buffer protocol
    • Near zero-copy data access
    • Type-specialized Python API
    • GPU acceleration support through Python bindings
  • Advanced DataFrame Operations

    • Complete implementation of long-form and wide-form conversion (melt, stack, unstack)
    • Enhanced conditional aggregation processing
    • Optimization of complex join operations
  • Enhanced Time Series Data Processing

    • Support for RFC3339 format date parsing
    • Complete implementation of advanced window operations
    • Support for complete format frequency specification (DAILY, WEEKLY, etc.)
    • GPU-accelerated time series operations
  • WebAssembly Support

    • Browser-based visualization capabilities
    • Interactive dashboard functionality
    • Theme customization options
    • Multiple visualization types support
  • Stability and Quality Improvements

    • Implementation of comprehensive test suite
    • Improved error handling and warning elimination
    • Enhanced documentation
    • Updated dependencies (Rust 2023 compatible)

High-Quality Visualization (Plotters Integration)

use pandrs::{DataFrame, Series};
use pandrs::vis::plotters_ext::{PlotSettings, PlotKind, OutputType};

// Create plot from a single Series
let values = vec![15.0, 23.5, 18.2, 29.8, 32.1, 28.5, 19.2];
let series = Series::new(values, Some("temperature_change".to_string()))?;

// Create line graph
let line_settings = PlotSettings {
    title: "Temperature Change Over Time".to_string(),
    x_label: "Time".to_string(),
    y_label: "Temperature (Β°C)".to_string(),
    plot_kind: PlotKind::Line,
    ..PlotSettings::default()
};
series.plotters_plot("temp_line.png", line_settings)?;

// Create histogram
let hist_settings = PlotSettings {
    title: "Histogram of Temperature Distribution".to_string(),
    plot_kind: PlotKind::Histogram,
    ..PlotSettings::default()
};
series.plotters_histogram("histogram.png", 5, hist_settings)?;

// Visualization using DataFrame
let mut df = DataFrame::new();
df.add_column("temperature".to_string(), series)?;
df.add_column("humidity".to_string(), 
    Series::new(vec![67.0, 72.3, 69.5, 58.2, 62.1, 71.5, 55.8], Some("humidity".to_string()))?)?;

// Scatter plot (relationship between temperature and humidity)
let xy_settings = PlotSettings {
    title: "Relationship Between Temperature and Humidity".to_string(),
    plot_kind: PlotKind::Scatter,
    output_type: OutputType::SVG,  // Output in SVG format
    ..PlotSettings::default()
};
df.plotters_xy("temperature", "humidity", "temp_humidity.svg", xy_settings)?;

// Multiple series line graph
let multi_settings = PlotSettings {
    title: "Weather Data Trends".to_string(),
    plot_kind: PlotKind::Line,
    ..PlotSettings::default()
};
df.plotters_multi(&["temperature", "humidity"], "multi_series.png", multi_settings)?;

Testing

Run the core library tests (no external dependencies):

cargo test --lib

Test core optimized features:

cargo test --features "test-core"

Test most features (excluding CUDA/WASM which require external tools):

cargo test --features "test-safe"

Test all features (requires CUDA toolkit and may have compilation issues):

cargo test --all-features

Notes:

  • The cuda feature requires NVIDIA CUDA toolkit installation
  • The visualization feature currently has compilation issues and is excluded from safe testing
  • Use test-core for reliable testing without external dependencies

Edge Case and Error Condition Testing

Run comprehensive edge case tests:

cargo test --test edge_cases_test

Run concurrency and thread safety tests:

cargo test --test concurrency_test

Run I/O error condition tests:

cargo test --test io_error_conditions_test

These tests cover:

  • Boundary Conditions: Empty data, single elements, extreme numeric values, NaN/infinity handling
  • Input Validation: Invalid column names, duplicate columns, mismatched lengths, out-of-bounds access
  • Resource Management: Large datasets (100K+ rows), string pool stress testing, memory cleanup
  • Concurrency: Multi-threaded access, string pool thread safety, race condition prevention
  • Type Safety: Type mismatches, column access patterns, error propagation
  • I/O Error Handling: File permissions, malformed data, disk space issues, encoding problems

Dependency Versions

Latest dependency versions:

[dependencies]
num-traits = "0.2.19"        # Numeric trait support
thiserror = "2.0.12"         # Error handling
serde = { version = "1.0.219", features = ["derive"] }  # Serialization
serde_json = "1.0.114"       # JSON processing
chrono = "0.4.38"            # Date and time processing (compatible with arrow ecosystem)
regex = "1.11.1"             # Regular expressions  
csv = "1.3.1"                # CSV processing
rayon = "1.10.0"             # Parallel processing
lazy_static = "1.5.0"        # Lazy initialization
rand = "0.9.0"               # Random number generation
tempfile = "3.8.1"           # Temporary files
textplots = "0.8.7"          # Text-based visualization
plotters = "0.3.7"           # High-quality visualization
chrono-tz = "0.9.0"          # Timezone processing (compatible with chrono 0.4.38)
parquet = "53.3.1"           # Parquet file support (compatible with arrow ecosystem)
arrow = "53.3.1"             # Arrow format support (stable version)
crossbeam-channel = "0.5.15" # For concurrent message passing
memmap2 = "0.9.5"            # For memory-mapped files
calamine = "0.23.1"          # For Excel reading
simple_excel_writer = "0.2.0" # For Excel writing

# Optional dependencies (feature-gated)
# CUDA support (requires CUDA toolkit installation)
cudarc = "0.10.0"            # CUDA bindings (requires NVIDIA CUDA toolkit)
half = "2.3.1"               # Half-precision floating point support
ndarray = "0.15.6"           # N-dimensional arrays

# Distributed processing
datafusion = "30.0.0"        # DataFrame processing engine (compatible with arrow 53.x)

# WebAssembly support
wasm-bindgen = "0.2.91"      # WebAssembly bindings
js-sys = "0.3.68"            # JavaScript interop
web-sys = "0.3.68"           # Web API bindings
plotters-canvas = "0.4.0"    # Canvas backend for plotters

License

PandRS is dual-licensed under either

at your option.

Dependencies

~31–58MB
~1M SLoC