42 releases (3 stable)
Uses new Rust 2024
| new 4.2.0 | Mar 12, 2026 |
|---|---|
| 0.17.0 | Feb 24, 2026 |
#295 in Data structures
2.5MB
37K
SLoC
Robin Sparkless
PySpark-style DataFrames in Rust—no JVM. A DataFrame library that mirrors PySpark’s API and semantics while using Polars as the execution engine. The same engine powers the Python package sparkless v4 (pip install ./python) — a drop-in PySpark replacement with no JVM.
Why Robin Sparkless?
- Familiar API —
SparkSession,DataFrame,Column, and PySpark-like functions so you can reuse patterns without the JVM. - Polars under the hood — Fast, native Rust execution with Polars for IO, expressions, and aggregations.
- Persistence options — Global temp views (cross-session in-memory) and disk-backed
saveAsTableviaspark.sql.warehouse.dir. - Sparkless backend target — Designed to power Sparkless (the Python PySpark replacement) as a Rust execution engine.
Features
| Area | What’s included |
|---|---|
| Core | SparkSession, DataFrame; lazy by default. Two expression APIs: (1) ExprIr (engine-agnostic): col, lit_i64, gt, when, … from crate root → filter_expr_ir, select_expr_ir, collect_rows, agg_expr_ir; (2) Column/Expr (Polars): prelude or functions → filter, with_column, select_exprs. Plus order_by, group_by, joins |
| IO | CSV, Parquet, JSON via SparkSession::read_* |
| Expressions | col(), lit(), when/then/otherwise, coalesce, cast, type/conditional helpers |
| Aggregates | count, sum, avg, min, max, and more; multi-column groupBy |
| Window | row_number, rank, dense_rank, lag, lead, first_value, last_value, and others with .over() |
| Arrays & maps | array_*, explode, create_map, map_keys, map_values, and related functions |
| Strings & JSON | String functions (upper, lower, substring, regexp_*, etc.), get_json_object, from_json, to_json |
| Datetime & math | Date/time extractors and arithmetic, year/month/day, math (sin, cos, sqrt, pow, …) |
| Optional SQL | spark.sql("SELECT ...") with temp views, global temp views (cross-session), and tables: createOrReplaceTempView, createOrReplaceGlobalTempView, table(name), table("global_temp.name"), df.write().saveAsTable(name, mode=...), spark.catalog().listTables() — enable with --features sql |
| Optional Delta | read_delta(path) or read_delta(table_name), read_delta_with_version, write_delta, write_delta_table(name) — enable with --features delta (path I/O); table-by-name works with sql only |
| UDFs | Pure-Rust UDFs registered in a session-scoped registry; see docs/UDF_GUIDE.md |
Parity: 200+ fixtures validated against PySpark. Known differences from PySpark are documented in docs/PYSPARK_DIFFERENCES.md. Out-of-scope items (XML, UDTF, streaming, RDD) are in docs/DEFERRED_SCOPE.md. Full parity status: docs/PARITY_STATUS.md.
Installation
Python (sparkless v4)
A Python package sparkless provides a PySpark-like API with no JVM, backed by robin-sparkless.
Install from PyPI:
pip install "sparkless>=4,<5"
Or from this repo (for developing the Rust backend alongside Python):
pip install ./python
See python/README.md for usage and from sparkless.sql import SparkSession.
Rust
Add to your Cargo.toml:
[dependencies]
robin-sparkless = "4"
Optional features:
robin-sparkless = { version = "4", features = ["sql"] } # spark.sql(), temp views
robin-sparkless = { version = "4", features = ["delta"] } # Delta Lake read/write
Quick start
Rust
Engine-agnostic (ExprIr) API — recommended for new code and embeddings; uses only EngineError and robin-sparkless types:
use robin_sparkless::{col, lit_i64, gt, SparkSession};
fn main() -> Result<(), robin_sparkless::EngineError> {
let spark = SparkSession::builder().app_name("demo").get_or_create();
let df = spark.create_dataframe_engine(
vec![
(1, 25, "Alice".to_string()),
(2, 30, "Bob".to_string()),
(3, 35, "Charlie".to_string()),
],
vec!["id", "age", "name"],
)?;
let adults = df.filter_expr_ir(>(col("age"), lit_i64(26)))?;
adults.show(Some(10)).map_err(robin_sparkless::to_engine_error)?;
Ok(())
}
Column (Polars) API — full PySpark-like API with Column and Expr:
use robin_sparkless::prelude::*;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let spark = SparkSession::builder().app_name("demo").get_or_create();
let df = spark.create_dataframe(
vec![
(1, 25, "Alice".to_string()),
(2, 30, "Bob".to_string()),
(3, 35, "Charlie".to_string()),
],
vec!["id", "age", "name"],
)?;
let adults = df.filter(col("age").gt(lit_i64(26).into_expr()).into_expr())?;
adults.show(Some(10))?;
Ok(())
}
Output (from show; run with cargo run --example demo):
shape: (2, 3)
┌─────┬─────┬─────────┐
│ id ┆ age ┆ name │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════════╡
│ 2 ┆ 30 ┆ Bob │
│ 3 ┆ 35 ┆ Charlie │
└─────┴─────┴─────────┘
You can also wrap an existing Polars DataFrame with DataFrame::from_polars(polars_df). See docs/QUICKSTART.md for joins, window functions, and more.
Embedding robin-sparkless in your app
Use the engine-agnostic ExprIr API and *_engine() methods so your public surface does not depend on Polars types. Use prelude for the full Column API, or root imports for ExprIr; optional config from environment for session setup. Results can be returned as JSON for bindings or CLI tools.
[dependencies]
robin-sparkless = "4"
use robin_sparkless::{col, lit_i64, gt, SparkSession, SparklessConfig, to_engine_error};
fn main() -> Result<(), robin_sparkless::EngineError> {
let config = SparklessConfig::from_env();
let spark = SparkSession::from_config(&config);
let df = spark.create_dataframe_engine(
vec![
(1i64, 10i64, "a".to_string()),
(2i64, 20i64, "b".to_string()),
(3i64, 30i64, "c".to_string()),
],
vec!["id", "value", "label"],
)?;
let filtered = df.filter_expr_ir(>(col("id"), lit_i64(1)))?;
let json = filtered.to_json_rows()?;
println!("{}", json);
Ok(())
}
Example output (from the snippet above or cargo run --example embed_readme; JSON key order may vary):
[{"id":2,"value":20,"label":"b"},{"id":3,"value":30,"label":"c"}]
Run the embed_basic example: cargo run --example embed_basic. For a minimal FFI surface, use robin_sparkless::prelude::embed and the ExprIr API: create_dataframe_engine, filter_expr_ir, select_expr_ir, collect_rows, agg_expr_ir, plus schema helpers (StructType::to_json, schema_from_json). Convert Polars errors with to_engine_error. See docs/EMBEDDING.md.
Development
Prerequisites: Rust (see rust-toolchain.toml).
This repository is a Cargo workspace. The main library is robin-sparkless (the facade); most users depend only on it. The workspace also includes robin-sparkless-core (engine-agnostic types, expression IR, config, error; no Polars) and robin-sparkless-polars (Polars backend: Column, functions, UDFs). These are publishable for advanced or minimal-use cases. make check and CI build the whole workspace.
| Command | Description |
|---|---|
cargo build |
Build (Rust only) |
cargo build --workspace --all-features |
Build all workspace crates with optional features |
cargo test |
Run Rust tests |
make test |
Run Rust tests (wrapper for cargo test --workspace) |
make check |
Rust only: format check, clippy, audit, deny, Rust tests. Runs fmt-check, audit, deny, and clippy in parallel (-j4), then tests. |
make check-full |
Full Rust check suite (what CI runs): clean, then fmt-check/clippy/audit/deny in parallel, then tests. |
make clean |
Remove target/ (e.g. to free disk without running check; check-full already cleans before each run so binaries don't accumulate). |
make fmt |
Format Rust code (run before check if you want to fix formatting). |
make test-parity-phase-a … make test-parity-phase-g |
Run parity fixtures for a specific phase (see PARITY_STATUS). |
make test-parity-phases |
Run all parity phases (A–G) via the parity harness. |
make sparkless-parity |
When SPARKLESS_EXPECTED_OUTPUTS is set and PySpark/Java are available, convert Sparkless fixtures, regenerate expected from PySpark, and run Rust parity tests. |
cargo bench |
Benchmarks (robin-sparkless vs Polars) |
cargo doc --open |
Build and open API docs |
CI runs format, clippy, audit, deny, Rust tests, and parity tests on push/PR (see .github/workflows/ci.yml).
Documentation
| Resource | Description |
|---|---|
| Python package | Sparkless v4 — install from PyPI (pip install sparkless) or pip install ./python, quick start, Sparkless 3 vs 4.x, API overview |
| Read the Docs | Full docs: quickstart, Rust usage, Python getting started, Sparkless integration (MkDocs) |
| docs.rs | Rust API reference |
| QUICKSTART | Build, usage, optional features, benchmarks |
| User Guide | Everyday usage (Rust) |
| Persistence Guide | Global temp views, disk-backed saveAsTable |
| UDF Guide | Scalar, vectorized, and grouped UDFs |
| PySpark Differences | Known divergences |
| Roadmap | Development phases, Sparkless integration |
| RELEASING | Publishing to crates.io and PyPI |
See CHANGELOG.md for version history.
License
MIT
Dependencies
~60–86MB
~1.5M SLoC