#bioinformatics #nushell #nushell-plugin #fasta #data #noodles #file-format

bin+lib nu_plugin_bio

Parse and manipulate common bioinformatic formats in nushell

5 releases (3 breaking)

0.85.0 Oct 18, 2023
0.76.0 Feb 23, 2023
0.74.2 Jan 24, 2023
0.74.1 Jan 23, 2023
0.70.0 Oct 25, 2022

#1234 in Parser implementations

MIT and maybe CC-PDDC

370KB
1.5K SLoC

Nushell bio

A bioinformatics plugin for nushell. This plugin parses most common bioinformatics formats into structured data so you can use them with nushell more effectively.

Quick setup

Go and get nushell, it's great. I'm assuming you have the rust toolchain installed. Then come back!

# clone this repo
git clone https://github.com/Euphrasiologist/nu_plugin_bio
# change into the repo directory
cd nu_plugin_bio
# build
# it's quite a long compile time...
cargo build --release
# register the plugin
register nu_plugin_bio/target/release/nu_plugin_bio

# see the current file formats currently supported below
# now you can just use open, and the file extension will be auto-detected.

# there are some test files in the tests/ dir.
open ./tests/test.fasta
    | get id

# if you want to add flags you have to explicitly use from <x>
# e.g. if you want descriptions in fasta files to be parsed.

open --raw ./tests/test.fasta 
    | from fasta -d
    | first

The backend is a noodles wrapper, an excellent, all-Rust bioinformatics I/O library.

Aims

Aim to support the following:

  • BAM 1.6
  • BCF 2.2
    • bcf.gz
  • VCF 4.3
    • vcf.gz
  • BED(3 only right now)
  • CRAM 3.0
  • FASTA
    • fa.gz
  • FASTQ
    • fq.gz
  • GFF3
  • GTF 2.2
  • SAM 1.6
  • GFA 1.0
    • gfa.gz

Note that performance will not be optimal with the current state of nu_plugin, as we cannot access the engine state of nushell, and therefore need to load entire data structures into memory. Testing still needs to be done on large files.

More?

If there's a bioinformatics format you want to add, let me know, or add a PR.

Dependencies

~22–37MB
~548K SLoC