3 releases

0.3.4-beta.9 Oct 27, 2023

#11 in #proteomics

Apache-2.0

7KB

Exon

Exon is an analysis toolkit for life-science applications. It features:

  • Support for many file formats from bioinformatics, proteomics, and others
  • Local filesystem and object storage support
  • Arrow FFI primitives for multi-language support
  • SQL based access to bioinformatics data -- general DML and some DDL support

Please note Exon was recently excised from a larger library, so please be patient as we work to clean up after that. If you have a comment or question in the meantime, please file an issue.

Installation

Exon is available via crates.io. To install, run:

cargo add exon

Usage

Exon is designed to be used as a library. For example, to read a FASTA file:

use exon::context::ExonSessionExt;

use datafusion::prelude::*;
use datafusion::error::Result;

let ctx = SessionContext::new_exon();

let df = ctx.read_fasta("test-data/datasources/fasta/test.fasta", None).await?;

Please see the rust docs for more information.

File Formats

Format Compression(s) Inferred Extension(s)
BAM - .bam
BCF - .bcf
BED gz, zstd .bed
FASTA gz, zstd .fasta, .fa, .fna
FASTQ gz, zstd .fastq, .fq
GENBANK gz, zstd .gbk, .genbank, .gb
GFF gz, zstd .gff
GTF gz, zstd .gtf
HMMDOMTAB gz, zstd .hmmdomtab
MZML gz, zstd .mzml[^2]
SAM - .sam
VCF gz[^1] .vcf

[^1]: Uses bgzip not gzip. [^2]: mzML also works.

Settings

Exon using the following settings:

Setting Default Description
exon.vcf_parse_info true Parse VCF INFO fields. If False, INFO fields will be returned as a single string.
exon.vcf_parse_formats true Parse VCF FORMAT fields. If False, FORMAT fields will be returned as a single string.

You can update the settings by running:

SET <setting> = <value>;

For example, to disable parsing of VCF INFO fields:

SET exon.vcf_parse_info = false;

Benchmarks

Please see the benchmarks README for more information.

Dependencies

~61MB
~1M SLoC