#genomics #dataframe #bioinformatics #ngs #arrow

oxbow

Read conventional genomic file formats as data frames and more via Apache Arrow

12 releases (6 breaking)

0.7.0 Mar 18, 2026
0.5.2 Mar 3, 2026
0.5.1 Dec 10, 2025
0.5.0 Nov 18, 2025
0.2.0 Jul 30, 2023

#91 in Biology

MIT/Apache

615KB
15K SLoC

oxbow

The core Rust library for oxbow.

Warning: oxbow is under active development. APIs are not yet stable and are subject to change.

Installation

To use oxbow in your Rust project, add oxbow to your Cargo.toml or run:

cargo add oxbow

Development

Ensure you have Rust installed on your system. You can install Rust using rustup.

Building the project

The oxbow Rust crate alone can be built using cargo.

cd oxbow
cargo build  # --release (for non-debug build)

Linting and formatting

We use the standard Rust toolchain for linting and formatting Rust code.

Clippy is a Rust linter:

cargo clippy

The following command formats all source files of the current crate using rustfmt:

cargo fmt

Running Tests

To run tests on Rust code, we use cargo:

cargo test

lib.rs:

oxbow

oxbow reads genomic data formats 🧬 as Apache Arrow 🏹.

With the oxbow Rust library, you can serialize native formats into Arrow IPC , stream larger-than-memory files as Arrow RecordBatches with zero-copy over FFI, and more!

⚠️ The Rust API is under active development and is not yet stable. The API may change in future releases.

Source on GitHub.

Features

  • 🚀 Supports commonly used file formats from the htslib/GA4GH and the UCSC ecosystems.
  • 🔍 Support for compression, indexing, column projection, and genomic range querying.
  • 🔧 Support for nested fields and complex, typed schemas (e.g., SAM tags, VCF INFO and FORMAT fields, AutoSql, etc.).

Scanners

The main interface to read files are the scanners. Each scanner is a parser for a specific format and provides scanning methods that return an iterator implementing the arrow::record_batch::RecordBatchReader trait.

Sequence formats

  • fasta: Scan FASTA files as Arrow RecordBatches.
  • fastq: Scan FASTQ files as Arrow RecordBatches.

Alignment formats

  • sam: Scan SAM files as Arrow RecordBatches.
  • bam: Scan BAM files as Arrow RecordBatches.
  • cram: Scan CRAM files as Arrow RecordBatches.

Variant formats

  • vcf: Scan VCF files as Arrow RecordBatches.
  • bcf: Scan BCF files as Arrow RecordBatches.

Interval feature formats

  • bed: Scan BED files as Arrow RecordBatches.
  • gtf: Scan GXF files as Arrow RecordBatches.
  • gff: Scan GFF files as Arrow RecordBatches.

UCSC Big Binary Indexed (BBI) formats

  • bigbed: Scan BigBed files as Arrow RecordBatches.
  • bigwig: Scan BigWig files as Arrow RecordBatches.
  • BBI zoom: Scan zoom level summary statistics from BigWig/BigBed as Arrow RecordBatches.

License

Licensed under MIT or Apache-2.0.

Dependencies

~38MB
~556K SLoC