9 releases
new 2.0.0-alpha.8 | Feb 11, 2025 |
---|
#66 in Compression
154 downloads per month
3.5MB
401 lines
Contains (Mach-o exe, 9MB) tests/miniphy2, (Mach-o exe, 2.5MB) miniphy2
MiniPhy2
- Introduction
- Installation
- Command-Line Usage
- Workflow - Compression of a single batch
- Issues
- Changelog
- License
- Support & Contact
Introduction
MiniPhy2 is the second version of the MiniPhy workflow for phylogenetic compression of large bacterial genome collections. This version has been entirely rewritten in Rust and minimizes on-disk operations; therefore, it is much more suitable for very large collections. The resulting compression performance should be near-identical compared to the original MiniPhy.
Installation
Prerequisites:
- Rust (latest stable release)
- CMake (required by some of the Rust packages)
Installation from git:
git clone git://github.com/karel-brinda/miniphy2
cd miniphy2
make
./miniphy2 -h
#./target/release/miniphy2 -h
Downloading automatically built binaries: Go to https://github.com/karel-brinda/miniphy2/actions?query=CI and find the corresponding artifact.
Command-Line Usage
General Syntax
miniphy2 [command] [options] [arguments]
The compress
command
Purpose: Compresses a single batch in a provided order (e.g., from AttoTree)
$ ./miniphy2 compress --help
Compress
Usage: miniphy2 compress [OPTIONS] <INPUT>...
Arguments:
<INPUT>... Files to include in the tar archive
Options:
-l, --list The provided files are lists of files
-f, --force Rewrite the output file if it already exists
-u, --uncompressed No TAR compression (otherwise compressed by xz -9 -T1 in memory)
-o, --output <FILE> Output file, - for stdout [default: -]
-p, --prefix <STR> Path prefix for files in the TAR file (e.g, batch1_ or batch1/) [default: ./]
-h, --help Print help
Workflow - Compression of a single batch
The input for compression are genome batches (of max. approximatelly 10k genomes), obtained for instance through MiniPhy 1. The following steps will compress a single batch.
1. Prepare input files
Generate a file containing genome paths:
find /batch/directory -name '*.fa' > input.txt
The resulting input.txt
is the list of genome file locations.
2. Compute a compressive phylogeny
Use AttoTree with the default parameters:
attotree -L input.txt -o tree.nw
3. Generate ordered genome paths
cat tree.nw | grep -o '[^,:()]*:' | sed 's/:$//' | grep -Ev ^$ \
| awk -v d="/batch/directory" '{print d "/" $0 ".fa"}' \
> phylogenetic_order.txt
Output: phylogenetic_order.txt
, an ordered sequence of genome file paths
for MiniPhy2 processing.
4. Run compression
./miniphy2 compress -p 'batchX/' -lfo compressed_genomes.tar.xz phylogenetic_order.txt
The options instruct MiniPhy2 to compress the genomes from
phylogenetic_order.txt in that order using xz -9 -T1
and save it into
compressed_genomes.tar.xz
, with rewritting if the file already exists.
Additionally, it will prepend batchX/
to each file name in output archive, so
everything will be in a directory with this name.
Issues
Please use Github issues.
Changelog
See Releases.
License
Support & Contact
Dependencies
~8–18MB
~249K SLoC