9 releases

2.0.0-alpha.8	Feb 11, 2025

#76 in Biology

MIT license

3.5MB
401 lines

Contains (Mach-o exe, 9MB) tests/miniphy2, (Mach-o exe, 2.5MB) miniphy2

MiniPhy2

Introduction
Installation
Command-Line Usage
- The compress command
Workflow - Compression of a single batch
Issues
Changelog
License
Support & Contact

Introduction

MiniPhy2 is the second version of the MiniPhy workflow for phylogenetic compression of large bacterial genome collections. This version has been entirely rewritten in Rust and minimizes on-disk operations; therefore, it is much more suitable for very large collections. The resulting compression performance should be near-identical compared to the original MiniPhy.

Installation

Prerequisites:

Rust (latest stable release)
CMake (required by some of the Rust packages)

Installation from git:

git clone git://github.com/karel-brinda/miniphy2
cd miniphy2
make
./miniphy2 -h
#./target/release/miniphy2 -h

Downloading automatically built binaries: Go to https://github.com/karel-brinda/miniphy2/actions?query=CI and find the corresponding artifact.

Command-Line Usage

General Syntax

miniphy2 [command] [options] [arguments]

The `compress` command

Purpose: Compresses a single batch in a provided order (e.g., from AttoTree)

$ ./miniphy2 compress --help
Compress

Usage: miniphy2 compress [OPTIONS] <INPUT>...

Arguments:
  <INPUT>...  Files to include in the tar archive

Options:
  -l, --list           The provided files are lists of files
  -f, --force          Rewrite the output file if it already exists
  -u, --uncompressed   No TAR compression (otherwise compressed by xz -9 -T1 in memory)
  -o, --output <FILE>  Output file, - for stdout [default: -]
  -p, --prefix <STR>   Path prefix for files in the TAR file (e.g, batch1_ or batch1/) [default: ./]
  -h, --help           Print help

Workflow - Compression of a single batch

The input for compression are genome batches (of max. approximatelly 10k genomes), obtained for instance through MiniPhy 1. The following steps will compress a single batch.

1. Prepare input files

Generate a file containing genome paths:

find /batch/directory -name '*.fa' > input.txt

The resulting input.txt is the list of genome file locations.

2. Compute a compressive phylogeny

Use AttoTree with the default parameters:

attotree -L input.txt -o tree.nw

3. Generate ordered genome paths

cat tree.nw | grep -o '[^,:()]*:' | sed 's/:$//' | grep -Ev ^$ \
   | awk -v d="/batch/directory" '{print d "/" $0 ".fa"}' \
   > phylogenetic_order.txt

Output: phylogenetic_order.txt, an ordered sequence of genome file paths for MiniPhy2 processing.

4. Run compression

./miniphy2 compress -p 'batchX/' -lfo compressed_genomes.tar.xz phylogenetic_order.txt

The options instruct MiniPhy2 to compress the genomes from phylogenetic_order.txt in that order using xz -9 -T1 and save it into compressed_genomes.tar.xz, with rewritting if the file already exists. Additionally, it will prepend batchX/ to each file name in output archive, so everything will be in a directory with this name.

Issues

Please use Github issues.

Changelog

See Releases.

License

MIT

Support & Contact

Karel Brinda <karel.brinda@inria.fr>

Dependencies

~8–17MB
~222K SLoC