#gzip #serialization #rds #object #dataframe #decompression #string-interning #deduplicate

bin+lib rds2rust

A pure Rust library for reading and writing R's RDS (R Data Serialization) files without requiring an R runtime

42 releases

0.1.41 Feb 23, 2026
0.1.40 Feb 9, 2026
0.1.39 Dec 20, 2025
0.1.22 Nov 27, 2025

#75 in WebAssembly

Download history 8/week @ 2025-11-22 1/week @ 2025-11-29 9/week @ 2025-12-06 1/week @ 2025-12-13 4/week @ 2026-02-28 74/week @ 2026-03-07

78 downloads per month

MIT license

1MB
21K SLoC

rds2rust

A pure Rust library for reading and writing R's RDS (R Data Serialization) files without requiring an R runtime. Inspired by rds2cpp, which provides similar functionality with a C++ implementation.

Crates.io Documentation License

Features

  • Pure Rust implementation - No R runtime required
  • Broad RDS format support - Reads and writes core R object types
  • Memory efficient - Optimized with string interning, compact attributes, and object deduplication
  • Automatic compression - Transparent gzip compression/decompression
  • Type safe - Strong Rust types for all R objects
  • Zero-copy where possible - Efficient parsing and serialization
  • Thread-aware - Use into_concrete_deep() before sharing parsed objects across threads

Supported R Types

  • Primitive types: NULL, integers, doubles, logicals, characters, raw bytes, complex numbers
  • Collections: vectors, lists, pairlists, expression vectors
  • Data structures: data frames, matrices, factors (ordered and unordered)
  • Object-oriented: S3 objects, S4 objects with slots
  • Language objects: formulas, unevaluated expressions, function calls
  • Functions: closures, environments, promises, special/builtin functions
  • Advanced: reference tracking (REFSXP), ALTREP compact sequences

Installation

Add this to your Cargo.toml:

[dependencies]
rds2rust = "0.1"

Quick Start

Reading an RDS file

use rds2rust::{read_rds, RObject};
use std::fs;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Read RDS file (automatically decompresses if gzipped)
    let data = fs::read("data.rds")?;
    let result = read_rds(&data)?;
    let obj = result.object;

    // Pattern match on R object type
    match obj {
        RObject::DataFrame(df) => {
            println!("Data frame with {} columns", df.columns.len());

            // Access a specific column
            if let Some(RObject::Real(values)) = df.columns.get("temperature") {
                println!("Temperature values: {:?}", values);
            }
        }
        RObject::Integer(vec) => {
            println!("Integer vector: {:?}", vec);
        }
        _ => println!("Other R object type"),
    }

    for warning in result.warnings {
        eprintln!("Warning: {}", warning);
    }

    Ok(())
}

Writing an RDS file

use rds2rust::{write_rds, RObject, VectorData};
use std::fs;
use std::sync::Arc;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create an R object (e.g., a character vector)
    let obj = RObject::Character(VectorData::Owned(vec![
        Arc::from("hello"),
        Arc::from("world"),
    ]));

    // Serialize to RDS format (automatically gzip compressed)
    let rds_data = write_rds(&obj)?;

    // Write to file
    fs::write("output.rds", rds_data)?;

    Ok(())
}

Streaming RDS writes (native)

For large outputs, stream directly to a Write sink to avoid buffering the whole file in memory.

use rds2rust::{write_rds_streaming, write_rds_atomic, RObject, VectorData};
use std::fs::File;
use std::io::BufWriter;
use std::sync::Arc;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let obj = RObject::Character(VectorData::Owned(vec![
        Arc::from("hello"),
        Arc::from("streaming"),
    ]));

    // Stream to a file (gzip compressed)
    let file = File::create("output.rds")?;
    write_rds_streaming(&obj, BufWriter::new(file))?;

    // Or write atomically (safe replace on success)
    write_rds_atomic(&obj, "output.rds")?;

    Ok(())
}

Working with Data Frames

use rds2rust::{read_rds, RObject};

// Read a data frame
let data = std::fs::read("iris.rds")?;
let result = read_rds(&data)?;
let obj = result.object;

if let RObject::DataFrame(df) = obj {
    // Access columns by name
    let sepal_length = df.columns.get("Sepal.Length");
    let species = df.columns.get("Species");

    // Access row names
    println!("First row name: {}", df.row_names[0]);

    // Iterate over columns
    for (name, values) in &df.columns {
        println!("Column: {}", name);
    }
}

Large-File Extraction (Streaming-Oriented)

For very large files, you can extract vectors without materializing the whole object in memory. The rds-extract CLI writes one file per vector plus an optional JSON manifest.

WASM Support (Large Files)

The WASM path uses async input, a Blob-backed chunk source, and worker-friendly helpers. Decompression uses a size-based strategy:

  • <500MB: in-memory buffer
  • 500MB–10GB: Blob-backed chunked reads
  • >10GB: streaming mode (sequential)

See docs/wasm_decompression.md for the JS helper, worker wrapper, and validation targets.

Gzip-compressed .rds.gz files are auto-detected in the WASM helper (browser support required: Chrome/Edge 89+, Firefox 102+, Safari 16.4+). Unsupported formats (bzip2/xz) return helpful errors.

WASM Streaming Decompression (Rust API)

For memory-efficient parsing of compressed RDS files in wasm32, use the Rust streaming API that automatically detects compression format and chooses the optimal parsing strategy:

use rds2rust::{
    check_streaming_decompression_support, traverse_rds_blob_streaming, ParseConfig, RdsVisitor,
};
use wasm_bindgen::JsValue;
use web_sys::Blob;

async fn parse_blob<V: RdsVisitor>(blob: Blob, visitor: &mut V) -> Result<(), JsValue> {
    check_streaming_decompression_support()
        .map_err(|msg| JsValue::from_str(&msg))?;
    traverse_rds_blob_streaming(blob, ParseConfig::default(), visitor)
        .await
        .map_err(|err| JsValue::from_str(&format!("{:?}", err)))
}

Memory Efficiency:

  • Gzip files: Uses DecompressionStream API with bounded buffer (64-128MB)
  • Uncompressed files: Uses cached random-access reads
  • Unsupported formats (xz/bzip2): Clear error with fallback instructions

Browser Requirements:

  • DecompressionStream API (Chrome 89+, Firefox 102+, Safari 16.4+)
  • For older browsers, use decompressBlobIfNeeded() to pre-decompress

Progress Reporting:

use rds2rust::{
    traverse_rds_blob_streaming_with_progress, ParseConfig, RdsVisitor, StreamingProgress,
};
use wasm_bindgen::JsValue;
use web_sys::Blob;

async fn parse_with_progress<V: RdsVisitor>(
    blob: Blob,
    visitor: &mut V,
) -> Result<(), JsValue> {
    let mut on_progress = |progress: StreamingProgress| {
        if let Some(total) = progress.total_bytes {
            let pct = 100.0 * progress.bytes_read as f64 / total as f64;
            web_sys::console::log_1(
                &format!("Progress: {} bytes ({:.1}%)", progress.bytes_read, pct).into(),
            );
        } else {
            web_sys::console::log_1(&format!("Progress: {} bytes", progress.bytes_read).into());
        }
    };

    traverse_rds_blob_streaming_with_progress(
        blob,
        ParseConfig::default(),
        visitor,
        &mut on_progress,
    )
    .await
    .map_err(|err| JsValue::from_str(&format!("{:?}", err)))
}

WASM Streaming Writer (Rust API)

WASM exposes chunked writer helpers that avoid large allocations in Rust. These emit Uint8Array chunks to a JS callback.

use js_sys::{Function, Uint8Array};
use rds2rust::{recommended_chunk_size_mb, write_rds_with_callback, RObject};
use wasm_bindgen::prelude::*;
use wasm_bindgen::JsCast;

fn write_with_callback(obj: &RObject) -> Result<(), JsValue> {
    let chunk_size_mb = Some(recommended_chunk_size_mb());
    let callback = Closure::wrap(Box::new(move |chunk: Uint8Array| {
        // Handle each chunk (e.g. push into a JS array)
        let _ = chunk;
    }) as Box<dyn FnMut(Uint8Array)>);

    let callback_fn: Function = callback.as_ref().unchecked_ref::<Function>().clone();
    write_rds_with_callback(obj, callback_fn, chunk_size_mb)
        .map_err(|err| JsValue::from_str(&format!("{:?}", err)))?;
    callback.forget();
    Ok(())
}

Progress callback reports bytes written (not percent):

// Use write_rds_with_progress(...) for byte count updates.

WASM Gzip Support

Format Extension Status
gzip .rds.gz, .rds.gzip Supported
uncompressed .rds Supported
bzip2 .rds.bz2 Unsupported
xz .rds.xz Unsupported

CLI

rds-extract data.rds out/ data.matrix meta.data --budget-mb 512 --manifest manifest.json
rds-extract data.rds out/ --object-path data --manifest manifest.json
rds-extract data.rds out/ --object-kind dataframe --object-path data
rds-extract convert data.rds out/ --object-kind dataframe --object-path data
rds-extract convert data.rds out/ --object-kind dataframe --chunked
rds-extract convert data.rds out/ --object-kind sparse-matrix --object-path data.matrix --chunked --chunk-size-mb 4

If no paths are provided, the root object is extracted. Use --object-path to expand higher-level objects (data.frames, dense matrices, sparse matrices, lists) into their component vectors. Use --object-kind to enforce the expected object type and emit a clearer error on mismatch. Use --chunked to avoid mapping the full decompressed stream in memory; it trades some performance for a lower steady-state memory footprint on huge files. When field names contain dots (e.g., slot.value), use quoted segments: data["slot.value"]. Streaming is the default and avoids materializing large lazy vectors; it streams spans directly from the backing store. Use --no-streaming to force materialization if needed. Use --chunk-size-mb to cap per-read buffer size when streaming. Streaming is best paired with --chunked to avoid mmap'ing large decompressed streams.

Streaming Traversal API

Use traverse_rds_streaming (sync) or traverse_rds_streaming_with_progress to walk an RDS stream without materializing large vectors. Implement RdsVisitor to receive events:

  • on_object_start / on_object_end for object boundaries
  • on_vector_metadata for vector length/kind
  • on_vector_chunk_available for lazy vector spans
  • on_shared_reference for REFSXP references (target path may be None)

Notes:

  • ALTREP metadata is best-effort: compact sequences and wrapped vectors emit estimated length; other forms only report attributes.
  • Singleton environment markers (global/base/empty/unbound) are treated as leaf nodes in streaming.

WASM Extraction APIs (Rust API)

WASM builds expose Rust helpers that return JsValue (typed arrays) or call a callback per chunk:

  • extract_vector_to_js(obj, source, path) -> JsValue
  • extract_vector_chunked(obj, source, path, chunk_size, callback)

Raw Dump Format

Each output file contains:

  • Header: RDS2VEC1 + version + kind + endian + reserved + length(u64) + elem_size(u32)
  • Payload:
    • Numeric/logical/complex/raw: element bytes (big-endian for numeric types)
    • Character: repeated records of i32 length + UTF-8 bytes

The manifest JSON lists each extracted vector with its path, file, kind, length, elem_size, and endian, plus a top-level object_kind. This allows a reader to map files back to R object paths.

Manifest Versioning

The manifest includes a top-level version field. Version 1 is the initial schema: { "version": 1, "object_kind": "...", "vectors": [...], "missing": [...] }. Future schema changes will increment this number and preserve backward compatibility where possible.

Reader Guidance

Recommended reader flow:

  1. Load the manifest JSON.
  2. For each entry, open the referenced .rdsvec file.
  3. Validate the header (RDS2VEC1, version, kind, endian, length, elem_size).
  4. Read payload:
    • Numeric/logical/complex/raw: fixed-size element bytes.
    • Character: repeated i32 length + UTF-8 bytes records.

Example validation helper:

use rds2rust::{read_extraction_manifest, validate_vector_file_header};

let manifest = read_extraction_manifest("out/manifest.json")?;
for entry in &manifest.vectors {
    let path = format!("out/{}", entry.file);
    validate_vector_file_header(&path, entry)?;
}

High-Level Conversion Helpers

Library callers can use the higher-level conversion helpers to expand objects and emit raw dumps plus manifests without manually enumerating paths:

use rds2rust::{
    extract_object_to_raw_files_with_input_streaming,
    extract_object_to_raw_files_with_kind_and_input_streaming,
    ChunkedRdsSource, ObjectKind, ParseConfig,
};

let source = ChunkedRdsSource::from_path("data.rds")?;
let obj = rds2rust::read_rds_with_input(&source, ParseConfig::for_trusted_large_file())?.object;
let output = extract_object_to_raw_files_with_input_streaming(
    &obj,
    &source,
    "data",
    Some(4 * 1024 * 1024),
    std::path::Path::new("out"),
    Some("manifest.json"),
)?;
let output = extract_object_to_raw_files_with_kind_and_input_streaming(
    &obj,
    &source,
    "data",
    ObjectKind::DataFrame,
    Some(4 * 1024 * 1024),
    std::path::Path::new("out"),
    Some("manifest.json"),
)?;

Chunked Read APIs

If you want chunked reads in library code, use the chunked path helpers:

use rds2rust::{read_rds_from_path_chunked, ParseConfig};

let obj = read_rds_from_path_chunked("data.rds")?.object;
let obj = rds2rust::read_rds_from_path_chunked_with_config(
    "data.rds",
    ParseConfig::for_trusted_large_file(),
)?.object;

Lazy metadata parsing with chunked reads:

use rds2rust::read_rds_lazy_from_path_chunked;

let obj = read_rds_lazy_from_path_chunked("data.rds")?.object;
assert!(!obj.is_fully_loaded());

Working with Factors

use rds2rust::{read_rds, RObject};

let data = std::fs::read("factor.rds")?;
let obj = read_rds(&data)?.object;

if let RObject::Factor(factor) = obj {
    // Check if it's an ordered factor
    if factor.ordered {
        println!("Ordered factor with {} levels", factor.levels.len());
    }

    // Get level labels
    for level in &factor.levels {
        println!("Level: {}", level);
    }

    // Get values (1-based indices into levels)
    for &index in &factor.values {
        if index > 0 && index <= factor.levels.len() as i32 {
            let level = &factor.levels[(index - 1) as usize];
            println!("Value: {}", level);
        }
    }
}

Working with S3/S4 Objects

use rds2rust::{read_rds, RObject};

let data = std::fs::read("model.rds")?;
let obj = read_rds(&data)?.object;

// S3 objects
if let RObject::S3Object(s3) = obj {
    println!("S3 class: {:?}", s3.class);

    // Access base object
    match s3.base.as_ref() {
        RObject::List(elements) => {
            println!("S3 object is a list with {} elements", elements.len());
        }
        _ => {}
    }

    // Access additional attributes
    if let Some(desc) = s3.attributes.get("description") {
        println!("Description: {:?}", desc);
    }
}

// S4 objects
if let RObject::S4Object(s4) = obj {
    println!("S4 class: {:?}", s4.class);

    // Access slots
    if let Some(slot_value) = s4.slots.get("data") {
        println!("Data slot: {:?}", slot_value);
    }
}

Roundtrip: Read and Write

use rds2rust::{read_rds, write_rds};
use std::fs;

// Read an RDS file
let input_data = fs::read("input.rds")?;
let obj = read_rds(&input_data)?.object;

// Process the data...
// (modify the object as needed)

// Write back to RDS format
let output_data = write_rds(&obj)?;
fs::write("output.rds", output_data)?;

// Verify roundtrip
let obj2 = read_rds(&output_data)?.object;
assert_eq!(obj, obj2);

Type System

The RObject enum represents all possible R object types:

pub enum RObject {
    Null,
    Integer(VectorData<i32>),
    Real(VectorData<f64>),
    Logical(VectorData<Logical>),
    Character(VectorData<Arc<str>>),
    Symbol(Arc<str>),
    Raw(VectorData<u8>),
    Complex(VectorData<Complex>),
    List(Vec<RObject>),
    Pairlist(Vec<PairlistElement>),
    Language { function: Box<RObject>, args: Vec<PairlistElement> },
    Expression(Vec<RObject>),
    Closure { formals: Box<RObject>, body: Box<RObject>, environment: Box<RObject> },
    Environment { enclosing: Box<RObject>, frame: Box<RObject>, hashtab: Box<RObject> },
    Promise { value: Box<RObject>, expression: Box<RObject>, environment: Box<RObject> },
    Special { name: Arc<str> },
    Builtin { name: Arc<str> },
    Bytecode { code: Box<RObject>, constants: Box<RObject>, expr: Box<RObject> },
    DataFrame(Box<DataFrameData>),
    Factor(Box<FactorData>),
    S3Object(Box<S3ObjectData>),
    S4Object(Box<S4ObjectData>),
    Namespace(Vec<Arc<str>>),
    GlobalEnv,
    BaseEnv,
    EmptyEnv,
    MissingArg,
    UnboundValue,
    Shared(Arc<RwLock<RObject>>),
    WithAttributes { object: Box<RObject>, attributes: Attributes },
}

Special Values

R's special values are represented as:

  • NA (integers): RObject::NA_INTEGER constant (i32::MIN)
  • NA (logicals): Logical::Na enum variant
  • NA (real): Check with f64::is_nan()
  • Inf/-Inf: f64::INFINITY and f64::NEG_INFINITY
  • NaN: f64::NAN

Memory Optimizations

rds2rust includes several memory optimizations for efficient data processing:

  1. String Interning - All strings use Arc<str> for automatic deduplication
  2. Boxed Large Variants - Large enum variants are boxed to reduce memory overhead
  3. Compact Attributes - SmallVec stores 0-2 attributes inline without heap allocation
  4. Object Deduplication - Identical objects are automatically shared during parsing

These optimizations provide 20-50% memory reduction for typical RDS files while maintaining zero API overhead.

Performance Tips

Reading Large Files

use rds2rust::read_rds;
use std::fs::File;
use std::io::Read;

// For very large files, read in chunks if needed
let mut file = File::open("large.rds")?;
let mut buffer = Vec::new();
file.read_to_end(&mut buffer)?;

let obj = read_rds(&buffer)?.object;

Reusing Parsed Objects

use std::sync::Arc;
use rds2rust::RObject;

// Wrap in Arc for cheap cloning
let obj = Arc::new(read_rds(&data)?.object);

// Clone is cheap (just increments reference count)
let obj2 = Arc::clone(&obj);

Limitations

  • Write support: All R types can be written except for some complex environment configurations
  • Compression formats: Currently supports gzip; bzip2/xz support planned
  • ALTREP: Reads ALTREP objects but writes them as regular vectors
  • External pointers: Not supported (rarely used in serialized data)

Development Status

Current version: 0.1.40

Test coverage: extensive test suite covering core R object types and roundtrips

Completed phases:

  • ✅ All basic R types (NULL, vectors, matrices, data frames)
  • ✅ All object-oriented types (S3, S4, factors)
  • ✅ All language types (expressions, formulas, closures, environments)
  • ✅ All special types (promises, special functions, builtin functions)
  • ✅ Reference tracking and ALTREP optimization
  • ✅ Complete read/write roundtrip support
  • ✅ Memory optimizations (string interning, compact attributes, deduplication)

License

Licensed under:

Resources

Dependencies

~1–10MB
~211K SLoC