#metagenomics #bioinformatics #genome #contamination

bin+lib nohuman

Remove human reads from a sequencing run

5 unstable releases

0.3.0 Oct 1, 2024
0.2.1 Sep 23, 2024
0.2.0 Sep 23, 2024
0.1.1 Jul 22, 2024
0.1.0 Dec 14, 2023

#139 in Biology

Custom license

77KB
1.5K SLoC

Rust 1K SLoC // 0.1% comments Shell 371 SLoC // 0.1% comments Just 8 SLoC // 0.3% comments

NoHuman

Rust CI Crates.io License: MIT github release version DOI:10.1093/gigascience/giae010

👤🧬🚫 Remove human reads from a sequencing run 👤🧬️🚫

nohuman removes human reads from sequencing reads by classifying them with kraken2 against a custom database built from all of the genomes in the Human Pangenome Reference Consortium's ( HPRC) first draft human pangenome reference. It can take any type of sequencing technology. Read more about the development of this method here.

Install

Conda (channel only) bioconda version Conda Downloads

$ conda install -c bioconda nohuman

Precompiled binary

GitHub Downloads (all assets, all releases)

[!IMPORTANT] You will need to install kraken2 yourself using this install method.

curl -sSL nohuman.mbh.sh | sh
# or with wget
wget -nv -O - nohuman.mbh.sh | sh

You can also pass options to the script like so

$ curl -sSL nohuman.mbh.sh | sh -s -- --help
install.sh [option]

Fetch and install the latest version of nohuman, if nohuman is already
installed it will be updated to the latest version.

Options
        -V, --verbose
                Enable verbose output for the installer

        -f, -y, --force, --yes
                Skip the confirmation prompt during installation

        -p, --platform
                Override the platform identified by the installer [default: apple-darwin]

        -b, --bin-dir
                Override the bin installation directory [default: /usr/local/bin]

        -a, --arch
                Override the architecture identified by the installer [default: x86_64]

        -B, --base-url
                Override the base URL used for downloading releases [default: https://github.com/mbhall88/nohuman/releases]

        -h, --help
                Display this help message

Cargo

Crates.io

[!IMPORTANT] You will need to install kraken2 yourself using this install method.

$ cargo install nohuman

Container

Docker images are hosted on the GitHub Container registry.

apptainer

Prerequisite: apptainer (previously singularity)

$ URI="docker://ghcr.io/mbhall88/nohuman:latest"
$ apptainer exec "$URI" nohuman --help

The above will use the latest version. If you want to specify a version then use a tag like so.

$ VERSION="0.2.1"
$ URI="docker://ghcr.io/mbhall88/nohuman:${VERSION}"

docker

Prerequisite: docker

$ docker pull ghcr.io/mbhall88/nohuman:latest
$ docker run ghcr.io/mbhall88/nohuman:latest nohuman --help

You can find all the available tags here.

Build from source

[!IMPORTANT] You will need to install kraken2 yourself using this install method.

$ git clone https://github.com/mbhall88/nohuman.git
$ cd nohuman
$ cargo build --release
$ target/release/nohuman -h

Usage

Download the database

$ nohuman -d

by default, this will place the database in $HOME/.nohuman/db. If you want to download it somewhere else, use the --db option.

Check dependencies are available

$ nohuman -c
[2023-12-14T04:10:46Z INFO ] All dependencies are available

Remove human reads

$ nohuman -t 4 in.fq

this will pass 4 threads to kraken2 and output the clean reads as in.nohuman.fq.

You can specify where to write the output file with -o

$ nohuman -t 4 -o clean.fq in.fq

If you have paired-end Illumina reads

$ nohuman -t 4 in_1.fq in_2.fq

or to specify a different path for the output

$ nohuman -t 4 --out1 clean_1.fq --out2 clean_2.fq in_1.fq in_2.fq

Set a minimum confidence score for kraken2 classifications

$ nohuman --conf 0.5 in.fq

or write the kraken2 read classification output to a file

$ nohuman -k kraken.out in.fq

[!TIP] Compressed output will be inferred from the specified output path(s). If no output path is provided, the same compression as the input will be used. To override the output compression format, use the --output-type option. Supported compression formats are gzip (.gz), zstandard (zst), bzip2 (.bz2), and xz (.xz). If multiple threads are provided, these will be used for compression of the output (where possible).

Keep human reads

You can invert the functionality of nohuman to keep only the human reads by using the --human/-H flag.

$ nohuman -h
Remove human reads from a sequencing run

Usage: nohuman [OPTIONS] [INPUT]...

Arguments:
  [INPUT]...  Input file(s) to remove human reads from

Options:
  -o, --out1 <OUTPUT_1>       First output file.
  -O, --out2 <OUTPUT_2>       Second output file.
  -c, --check                 Check that all required dependencies are available and exit
  -d, --download              Download the database
  -D, --db <PATH>             Path to the database [default: /home/michael/.nohuman/db]
  -F, --output-type <FORMAT>  Output compression format. u: uncompressed; b: Bzip2; g: Gzip; x: Xz (Lzma); z: Zstd
  -t, --threads <INT>         Number of threads to use in kraken2 and optional output compression. Cannot be 0 [default: 1]
  -H, --human                 Output human reads instead of removing them
  -C, --conf <[0, 1]>         Kraken2 minimum confidence score [default: 0.0]
  -k, --kraken-output <FILE>  Write the Kraken2 read classification output to a file  
  -v, --verbose               Set the logging level to verbose
  -h, --help                  Print help (see more with '--help')
  -V, --version               Print version

Full usage

$ nohuman --help
Remove human reads from a sequencing run

Usage: nohuman [OPTIONS] [INPUT]...

Arguments:
  [INPUT]...
          Input file(s) to remove human reads from

Options:
  -o, --out1 <OUTPUT_1>
          First output file.

          Defaults to the name of the first input file with the suffix "nohuman" appended.
          e.g. "input_1.fastq" -> "input_1.nohuman.fq".
          Compression of the output file is determined by the file extension of the output file name.
          Or by using the `--output-type` option. If no output path is given, the same compression
          as the input file will be used.

  -O, --out2 <OUTPUT_2>
          Second output file.

          Defaults to the name of the first input file with the suffix "nohuman" appended.
          e.g. "input_2.fastq" -> "input_2.nohuman.fq".
          Compression of the output file is determined by the file extension of the output file name.
          Or by using the `--output-type` option. If no output path is given, the same compression
          as the input file will be used.

  -c, --check
          Check that all required dependencies are available and exit

  -d, --download
          Download the database

  -D, --db <PATH>
          Path to the database

          [default: ~/.nohuman/db]

  -F, --output-type <FORMAT>
          Output compression format. u: uncompressed; b: Bzip2; g: Gzip; x: Xz (Lzma); z: Zstd

          If not provided, the format will be inferred from the given output file name(s), or the
          format of the input file(s) if no output file name(s) are given.

  -t, --threads <INT>
          Number of threads to use in kraken2 and optional output compression. Cannot be 0

          [default: 1]

  -H, --human
          Output human reads instead of removing them
          
  -C, --conf <[0, 1]>
          Kraken2 minimum confidence score

          [default: 0.0]
          
  -k, --kraken-output <FILE>
          Write the Kraken2 read classification output to a file
          
  -v, --verbose
          Set the logging level to verbose

  -h, --help
          Print help (see a summary with '-h')

  -V, --version
          Print version

Alternates

Hostile is an alignment-based approach that performs well. It take longer and uses more memory than the nohuman kraken approach, but has slightly better accuracy for Illumina data. See the paper for more details and for other alternate approaches.

Cite

DOI:10.1093/gigascience/giae010

Hall, Michael B., and Lachlan J. M. Coin. “Pangenome databases improve host removal and mycobacteria classification from clinical metagenomic data” GigaScience, April 4, 2024. https://doi.org/10.1093/gigascience/giae010

@article{hall_pangenome_2024,
	title = {Pangenome databases improve host removal and mycobacteria classification from clinical metagenomic data},
	volume = {13},
	issn = {2047-217X},
	url = {https://doi.org/10.1093/gigascience/giae010},
	doi = {10.1093/gigascience/giae010},
	urldate = {2024-04-07},
	journal = {GigaScience},
	author = {Hall, Michael B and Coin, Lachlan J M},
	month = jan,
	year = {2024},
	pages = {giae010},
}

Dependencies

~25–39MB
~570K SLoC