2 releases

0.1.1-alpha.4 Jun 7, 2023
0.1.1-alpha.3 Jun 6, 2023

#721 in Science

31 downloads per month

Apache-2.0

325KB
6.5K SLoC

TCA is a library to facilitate open-ended analysis of scientific data, ease the application of ML models, and provide a common data interface for science and engineering teams.

Overview

The main interface for users is through datafusion's SessionContext plus the TCASessionExt extension trait. This has a number of convenience methods for loading data from various sources.

See the read_* methods on TCASessionExt for more information. For example, read_fasta, or read_gff. There's also a read_inferred_tca_table method that will attempt to infer the data type and compression from the file extension for ease of use.

To facilitate those methods, TCA implements a number of traits for DataFusion that serve as a good base for scientific data work. See the datasources module for more information.

Examples

Loading a FASTA file

use tca::context::TCASessionExt;

use datafusion::prelude::*;
use datafusion::error::Result;

let ctx = SessionContext::new();

let df = ctx.read_fasta("test-data/datasources/fasta/test.fasta", None).await?;

assert_eq!(df.schema().fields().len(), 3);
assert_eq!(df.schema().field(0).name(), "id");
assert_eq!(df.schema().field(1).name(), "description");
assert_eq!(df.schema().field(2).name(), "sequence");

let results = df.collect().await?;
assert_eq!(results.len(), 1);  // 1 batch, small dataset

Loading a ZSTD-compressed FASTA file

use tca::context::TCASessionExt;

use datafusion::prelude::*;
use datafusion::error::Result;
use datafusion::datasource::file_format::file_type::FileCompressionType;

let ctx = SessionContext::new();

let file_compression = FileCompressionType::ZSTD;
let df = ctx.read_fasta("test-data/datasources/fasta/test.fasta.zstd", Some(file_compression)).await?;

assert_eq!(df.schema().fields().len(), 3);
assert_eq!(df.schema().field(0).name(), "id");
assert_eq!(df.schema().field(1).name(), "description");
assert_eq!(df.schema().field(2).name(), "sequence");

let results = df.collect().await?;
assert_eq!(results.len(), 1);  // 1 batch, small dataset

Dependencies

~68MB
~1.5M SLoC