#dataframe #polars-dataframe #polars #etl #pyspark

robin-sparkless

PySpark-like DataFrame API in Rust on Polars; no JVM

42 releases (3 stable)

Uses new Rust 2024

new 4.2.0 Mar 12, 2026
0.17.0 Feb 24, 2026

#295 in Data structures

MIT license

2.5MB
37K SLoC

Robin Sparkless

PySpark-style DataFrames in Rust—no JVM. A DataFrame library that mirrors PySpark’s API and semantics while using Polars as the execution engine. The same engine powers the Python package sparkless v4 (pip install ./python) — a drop-in PySpark replacement with no JVM.

CI crates.io docs.rs Documentation License: MIT


Why Robin Sparkless?

  • Familiar APISparkSession, DataFrame, Column, and PySpark-like functions so you can reuse patterns without the JVM.
  • Polars under the hood — Fast, native Rust execution with Polars for IO, expressions, and aggregations.
  • Persistence options — Global temp views (cross-session in-memory) and disk-backed saveAsTable via spark.sql.warehouse.dir.
  • Sparkless backend target — Designed to power Sparkless (the Python PySpark replacement) as a Rust execution engine.

Features

Area What’s included
Core SparkSession, DataFrame; lazy by default. Two expression APIs: (1) ExprIr (engine-agnostic): col, lit_i64, gt, when, … from crate root → filter_expr_ir, select_expr_ir, collect_rows, agg_expr_ir; (2) Column/Expr (Polars): prelude or functionsfilter, with_column, select_exprs. Plus order_by, group_by, joins
IO CSV, Parquet, JSON via SparkSession::read_*
Expressions col(), lit(), when/then/otherwise, coalesce, cast, type/conditional helpers
Aggregates count, sum, avg, min, max, and more; multi-column groupBy
Window row_number, rank, dense_rank, lag, lead, first_value, last_value, and others with .over()
Arrays & maps array_*, explode, create_map, map_keys, map_values, and related functions
Strings & JSON String functions (upper, lower, substring, regexp_*, etc.), get_json_object, from_json, to_json
Datetime & math Date/time extractors and arithmetic, year/month/day, math (sin, cos, sqrt, pow, …)
Optional SQL spark.sql("SELECT ...") with temp views, global temp views (cross-session), and tables: createOrReplaceTempView, createOrReplaceGlobalTempView, table(name), table("global_temp.name"), df.write().saveAsTable(name, mode=...), spark.catalog().listTables() — enable with --features sql
Optional Delta read_delta(path) or read_delta(table_name), read_delta_with_version, write_delta, write_delta_table(name) — enable with --features delta (path I/O); table-by-name works with sql only
UDFs Pure-Rust UDFs registered in a session-scoped registry; see docs/UDF_GUIDE.md

Parity: 200+ fixtures validated against PySpark. Known differences from PySpark are documented in docs/PYSPARK_DIFFERENCES.md. Out-of-scope items (XML, UDTF, streaming, RDD) are in docs/DEFERRED_SCOPE.md. Full parity status: docs/PARITY_STATUS.md.


Installation

Python (sparkless v4)

A Python package sparkless provides a PySpark-like API with no JVM, backed by robin-sparkless.

Install from PyPI:

pip install "sparkless>=4,<5"

Or from this repo (for developing the Rust backend alongside Python):

pip install ./python

See python/README.md for usage and from sparkless.sql import SparkSession.

Rust

Add to your Cargo.toml:

[dependencies]
robin-sparkless = "4"

Optional features:

robin-sparkless = { version = "4", features = ["sql"] }   # spark.sql(), temp views
robin-sparkless = { version = "4", features = ["delta"] }  # Delta Lake read/write

Quick start

Rust

Engine-agnostic (ExprIr) API — recommended for new code and embeddings; uses only EngineError and robin-sparkless types:

use robin_sparkless::{col, lit_i64, gt, SparkSession};

fn main() -> Result<(), robin_sparkless::EngineError> {
    let spark = SparkSession::builder().app_name("demo").get_or_create();
    let df = spark.create_dataframe_engine(
        vec![
            (1, 25, "Alice".to_string()),
            (2, 30, "Bob".to_string()),
            (3, 35, "Charlie".to_string()),
        ],
        vec!["id", "age", "name"],
    )?;
    let adults = df.filter_expr_ir(&gt(col("age"), lit_i64(26)))?;
    adults.show(Some(10)).map_err(robin_sparkless::to_engine_error)?;
    Ok(())
}

Column (Polars) API — full PySpark-like API with Column and Expr:

use robin_sparkless::prelude::*;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let spark = SparkSession::builder().app_name("demo").get_or_create();
    let df = spark.create_dataframe(
        vec![
            (1, 25, "Alice".to_string()),
            (2, 30, "Bob".to_string()),
            (3, 35, "Charlie".to_string()),
        ],
        vec!["id", "age", "name"],
    )?;
    let adults = df.filter(col("age").gt(lit_i64(26).into_expr()).into_expr())?;
    adults.show(Some(10))?;
    Ok(())
}

Output (from show; run with cargo run --example demo):

shape: (2, 3)
┌─────┬─────┬─────────┐
│ id  ┆ age ┆ name    │
│ ---------     │
│ i64i64str     │
╞═════╪═════╪═════════╡
│ 230  ┆ Bob     │
│ 335  ┆ Charlie │
└─────┴─────┴─────────┘

You can also wrap an existing Polars DataFrame with DataFrame::from_polars(polars_df). See docs/QUICKSTART.md for joins, window functions, and more.

Embedding robin-sparkless in your app

Use the engine-agnostic ExprIr API and *_engine() methods so your public surface does not depend on Polars types. Use prelude for the full Column API, or root imports for ExprIr; optional config from environment for session setup. Results can be returned as JSON for bindings or CLI tools.

[dependencies]
robin-sparkless = "4"
use robin_sparkless::{col, lit_i64, gt, SparkSession, SparklessConfig, to_engine_error};

fn main() -> Result<(), robin_sparkless::EngineError> {
    let config = SparklessConfig::from_env();
    let spark = SparkSession::from_config(&config);

    let df = spark.create_dataframe_engine(
        vec![
            (1i64, 10i64, "a".to_string()),
            (2i64, 20i64, "b".to_string()),
            (3i64, 30i64, "c".to_string()),
        ],
        vec!["id", "value", "label"],
    )?;
    let filtered = df.filter_expr_ir(&gt(col("id"), lit_i64(1)))?;
    let json = filtered.to_json_rows()?;
    println!("{}", json);
    Ok(())
}

Example output (from the snippet above or cargo run --example embed_readme; JSON key order may vary):

[{"id":2,"value":20,"label":"b"},{"id":3,"value":30,"label":"c"}]

Run the embed_basic example: cargo run --example embed_basic. For a minimal FFI surface, use robin_sparkless::prelude::embed and the ExprIr API: create_dataframe_engine, filter_expr_ir, select_expr_ir, collect_rows, agg_expr_ir, plus schema helpers (StructType::to_json, schema_from_json). Convert Polars errors with to_engine_error. See docs/EMBEDDING.md.

Development

Prerequisites: Rust (see rust-toolchain.toml).

This repository is a Cargo workspace. The main library is robin-sparkless (the facade); most users depend only on it. The workspace also includes robin-sparkless-core (engine-agnostic types, expression IR, config, error; no Polars) and robin-sparkless-polars (Polars backend: Column, functions, UDFs). These are publishable for advanced or minimal-use cases. make check and CI build the whole workspace.

Command Description
cargo build Build (Rust only)
cargo build --workspace --all-features Build all workspace crates with optional features
cargo test Run Rust tests
make test Run Rust tests (wrapper for cargo test --workspace)
make check Rust only: format check, clippy, audit, deny, Rust tests. Runs fmt-check, audit, deny, and clippy in parallel (-j4), then tests.
make check-full Full Rust check suite (what CI runs): clean, then fmt-check/clippy/audit/deny in parallel, then tests.
make clean Remove target/ (e.g. to free disk without running check; check-full already cleans before each run so binaries don't accumulate).
make fmt Format Rust code (run before check if you want to fix formatting).
make test-parity-phase-amake test-parity-phase-g Run parity fixtures for a specific phase (see PARITY_STATUS).
make test-parity-phases Run all parity phases (A–G) via the parity harness.
make sparkless-parity When SPARKLESS_EXPECTED_OUTPUTS is set and PySpark/Java are available, convert Sparkless fixtures, regenerate expected from PySpark, and run Rust parity tests.
cargo bench Benchmarks (robin-sparkless vs Polars)
cargo doc --open Build and open API docs

CI runs format, clippy, audit, deny, Rust tests, and parity tests on push/PR (see .github/workflows/ci.yml).


Documentation

Resource Description
Python package Sparkless v4 — install from PyPI (pip install sparkless) or pip install ./python, quick start, Sparkless 3 vs 4.x, API overview
Read the Docs Full docs: quickstart, Rust usage, Python getting started, Sparkless integration (MkDocs)
docs.rs Rust API reference
QUICKSTART Build, usage, optional features, benchmarks
User Guide Everyday usage (Rust)
Persistence Guide Global temp views, disk-backed saveAsTable
UDF Guide Scalar, vectorized, and grouped UDFs
PySpark Differences Known divergences
Roadmap Development phases, Sparkless integration
RELEASING Publishing to crates.io and PyPI

See CHANGELOG.md for version history.


License

MIT

Dependencies

~60–86MB
~1.5M SLoC