#tar #fasta #file #ordered #batch #input

app miniphy

Create an ordered FASTA TAR file

9 releases

new 2.0.0-alpha.8 Feb 11, 2025

#66 in Compression

Download history

154 downloads per month

MIT license

3.5MB
401 lines

Contains (Mach-o exe, 9MB) tests/miniphy2, (Mach-o exe, 2.5MB) miniphy2

MiniPhy2

Introduction

MiniPhy2 is the second version of the MiniPhy workflow for phylogenetic compression of large bacterial genome collections. This version has been entirely rewritten in Rust and minimizes on-disk operations; therefore, it is much more suitable for very large collections. The resulting compression performance should be near-identical compared to the original MiniPhy.

Installation

Prerequisites:

  • Rust (latest stable release)
  • CMake (required by some of the Rust packages)

Installation from git:

git clone git://github.com/karel-brinda/miniphy2
cd miniphy2
make
./miniphy2 -h
#./target/release/miniphy2 -h

Downloading automatically built binaries: Go to https://github.com/karel-brinda/miniphy2/actions?query=CI and find the corresponding artifact.

Command-Line Usage

General Syntax

miniphy2 [command] [options] [arguments]

The compress command

Purpose: Compresses a single batch in a provided order (e.g., from AttoTree)

$ ./miniphy2 compress --help
Compress

Usage: miniphy2 compress [OPTIONS] <INPUT>...

Arguments:
  <INPUT>...  Files to include in the tar archive

Options:
  -l, --list           The provided files are lists of files
  -f, --force          Rewrite the output file if it already exists
  -u, --uncompressed   No TAR compression (otherwise compressed by xz -9 -T1 in memory)
  -o, --output <FILE>  Output file, - for stdout [default: -]
  -p, --prefix <STR>   Path prefix for files in the TAR file (e.g, batch1_ or batch1/) [default: ./]
  -h, --help           Print help

Workflow - Compression of a single batch

The input for compression are genome batches (of max. approximatelly 10k genomes), obtained for instance through MiniPhy 1. The following steps will compress a single batch.

1. Prepare input files

Generate a file containing genome paths:

find /batch/directory -name '*.fa' > input.txt

The resulting input.txt is the list of genome file locations.

2. Compute a compressive phylogeny

Use AttoTree with the default parameters:

attotree -L input.txt -o tree.nw

3. Generate ordered genome paths

cat tree.nw | grep -o '[^,:()]*:' | sed 's/:$//' | grep -Ev ^$ \
   | awk -v d="/batch/directory" '{print d "/" $0 ".fa"}' \
   > phylogenetic_order.txt

Output: phylogenetic_order.txt, an ordered sequence of genome file paths for MiniPhy2 processing.

4. Run compression

./miniphy2 compress -p 'batchX/' -lfo compressed_genomes.tar.xz phylogenetic_order.txt

The options instruct MiniPhy2 to compress the genomes from phylogenetic_order.txt in that order using xz -9 -T1 and save it into compressed_genomes.tar.xz, with rewritting if the file already exists. Additionally, it will prepend batchX/ to each file name in output archive, so everything will be in a directory with this name.

Issues

Please use Github issues.

Changelog

See Releases.

License

MIT

Support & Contact

Dependencies

~8–18MB
~249K SLoC