54 releases (4 breaking)
0.5.3 | Nov 29, 2023 |
---|---|
0.4.3 | Nov 27, 2023 |
0.2.6 | Jul 31, 2023 |
#200 in Data structures
1,309 downloads per month
Used in 2 crates
680KB
14K
SLoC
Exon is an analysis toolkit for life-science applications. It features:
- Support for many file formats from bioinformatics, proteomics, and others
- Local filesystem and object storage support
- Arrow FFI primitives for multi-language support
- SQL based access to bioinformatics data -- general DML and some DDL support
Installation
Exon is available via crates.io. To install, run:
cargo add exon
Documentation
Related Projects
Benchmarks
Please see the benchmarks README for more information.
lib.rs
:
Exon is a library to facilitate open-ended analysis of scientific data, ease the application of ML models, and provide a common data interface for science and engineering teams.
Overview
The main interface for users is through datafusion's SessionContext
plus the ExonSessionExt
extension trait. This has a number of convenience methods for loading data from various sources.
See the read_*
methods on ExonSessionExt
for more information. For example, read_fasta
, or read_gff
. There's also a read_inferred_exon_table
method that will attempt to infer the data type and compression from the file extension for ease of use.
To facilitate those methods, Exon implements a number of traits for DataFusion that serve as a good base for scientific data work. See the datasources
module for more information.
Examples
Loading a FASTQ file
use exon::ExonSessionExt;
use datafusion::prelude::*;
use datafusion::error::Result;
let ctx = SessionContext::new();
let df = ctx.read_fastq("test-data/datasources/fastq/test.fastq", None).await?;
assert_eq!(df.schema().fields().len(), 4);
assert_eq!(df.schema().field(0).name(), "name");
assert_eq!(df.schema().field(1).name(), "description");
assert_eq!(df.schema().field(2).name(), "sequence");
assert_eq!(df.schema().field(3).name(), "quality_scores");
Loading a ZSTD-compressed FASTA file
use exon::ExonSessionExt;
use datafusion::prelude::*;
use datafusion::error::Result;
use datafusion::common::FileCompressionType;
let ctx = SessionContext::new();
let file_compression = FileCompressionType::ZSTD;
let df = ctx.read_fasta("test-data/datasources/fasta/test.fasta.zstd", Some(file_compression)).await?;
assert_eq!(df.schema().fields().len(), 3);
assert_eq!(df.schema().field(0).name(), "id");
assert_eq!(df.schema().field(1).name(), "description");
assert_eq!(df.schema().field(2).name(), "sequence");
let results = df.collect().await?;
assert_eq!(results.len(), 1); // 1 batch, small dataset
Dependencies
~69MB
~1.5M SLoC