#biology #science #arrow

exon

A platform for scientific data processing and analysis

54 releases (4 breaking)

0.5.3 Nov 29, 2023
0.4.3 Nov 27, 2023
0.2.6 Jul 31, 2023

#200 in Data structures

Download history 107/week @ 2023-08-20 84/week @ 2023-08-27 153/week @ 2023-09-03 426/week @ 2023-09-10 359/week @ 2023-09-17 253/week @ 2023-09-24 290/week @ 2023-10-01 457/week @ 2023-10-08 320/week @ 2023-10-15 252/week @ 2023-10-22 407/week @ 2023-10-29 383/week @ 2023-11-05 226/week @ 2023-11-12 253/week @ 2023-11-19 642/week @ 2023-11-26 170/week @ 2023-12-03

1,309 downloads per month
Used in 2 crates

Apache-2.0

680KB
14K SLoC

Exon

Exon is an analysis toolkit for life-science applications. It features:

  • Support for many file formats from bioinformatics, proteomics, and others
  • Local filesystem and object storage support
  • Arrow FFI primitives for multi-language support
  • SQL based access to bioinformatics data -- general DML and some DDL support

Installation

Exon is available via crates.io. To install, run:

cargo add exon

Documentation

  • Rust documentation is available here.
  • General documentation is available here.

Benchmarks

Please see the benchmarks README for more information.


lib.rs:

Exon is a library to facilitate open-ended analysis of scientific data, ease the application of ML models, and provide a common data interface for science and engineering teams.

Overview

The main interface for users is through datafusion's SessionContext plus the ExonSessionExt extension trait. This has a number of convenience methods for loading data from various sources.

See the read_* methods on ExonSessionExt for more information. For example, read_fasta, or read_gff. There's also a read_inferred_exon_table method that will attempt to infer the data type and compression from the file extension for ease of use.

To facilitate those methods, Exon implements a number of traits for DataFusion that serve as a good base for scientific data work. See the datasources module for more information.

Examples

Loading a FASTQ file

use exon::ExonSessionExt;

use datafusion::prelude::*;
use datafusion::error::Result;

let ctx = SessionContext::new();

let df = ctx.read_fastq("test-data/datasources/fastq/test.fastq", None).await?;

assert_eq!(df.schema().fields().len(), 4);
assert_eq!(df.schema().field(0).name(), "name");
assert_eq!(df.schema().field(1).name(), "description");
assert_eq!(df.schema().field(2).name(), "sequence");
assert_eq!(df.schema().field(3).name(), "quality_scores");

Loading a ZSTD-compressed FASTA file

use exon::ExonSessionExt;

use datafusion::prelude::*;
use datafusion::error::Result;
use datafusion::common::FileCompressionType;

let ctx = SessionContext::new();

let file_compression = FileCompressionType::ZSTD;
let df = ctx.read_fasta("test-data/datasources/fasta/test.fasta.zstd", Some(file_compression)).await?;

assert_eq!(df.schema().fields().len(), 3);
assert_eq!(df.schema().field(0).name(), "id");
assert_eq!(df.schema().field(1).name(), "description");
assert_eq!(df.schema().field(2).name(), "sequence");

let results = df.collect().await?;
assert_eq!(results.len(), 1);  // 1 batch, small dataset

Dependencies

~69MB
~1.5M SLoC