1 stable release

new 1.2.3 Mar 14, 2025

#1552 in Database interfaces

CC-BY-NC-SA-4.0

68KB
1.5K SLoC

raxtax - raxtax Accelerates Taxonomic Classification

This project is heavily inspired by the SINTAX algorithm [1].

Usage

For maximum performance, build the program with cargo build --profile=ultra.

Usage: raxtax [OPTIONS] --database-path <DATABASE_PATH> --query-file <QUERY_FILE>

Options:
  -d, --database-path <DATABASE_PATH>  Path to the database fasta or bin file
  -i, --query-file <QUERY_FILE>        Path to the query file
      --skip-exact-matches             If used for mislabling analysis, you want to skip exact sequence matches
      --tsv                            Output primary result file in tsv format
      --make-db                        Create a binary database to load instead of a fasta file for repeated execution
  -t, --threads <THREADS>              Number of threads
                                       If 0, uses all available threads [default: 0]
  -o, --prefix <PREFIX>                Output prefix
      --redo                           Force override of existing output files
      --pin                            Use thread pinning
  -v, --verbose...                     Increase logging verbosity
  -q, --quiet...                       Decrease logging verbosity
  -h, --help                           Print help
  -V, --version                        Print version

Format

Input files may be provided as gzip compressed archives.

Input Database

The input format for the database file is FASTA. Sequence identifier should have the form tax=<lineage>;. Everything after tax= is parsed as a comma-separated list of lineage nodes and is terminated by a semicolon. Lineages may have different depth, the only requirement is that they can be parsed into a multi-furcating tree. We use phylum to sequence for the examples in this README to aid readability. For example, an entry may look like this:

# example sequence
>metadata;tax=Arthropoda,Insecta,Diptera,Muscidae,Musca,Musca_domestica;
ACTCGATAC

Input Query

The format for query sequences is also FASTA, but more relaxed than the database format:

# example sequence
>query1
ACTCGATAC

Output

raxtax will produce 2 primary output files under the prefix specified with -o (defaults to name_of_query_file.out/).

  1. <PREFIX>/raxtax.out is the full result of the analysis. It contains for each query sequence a line for each database sequence where the confidence value is above 0.01 (confidence values are between 0 and 1). If no database sequence fulfills this criterion, a single line containing the best match is printed. In this case, values are rounded up to 0.01. The format is (tab separated):
query1    Arthropoda,Insecta,Diptera,Muscidae,Musca,Musca_domestica    1.0,1.0,0.8,0.68,0.52,0.31  0.67456  0.71234

The first part is simply the query label. The second part is the taxonomic lineage of the respective database sequence. The third part contains the confidence values for each level of the taxonomic lineage. It is important to understand that these values are always relative to the sequences in the database and therefore should be interpreted carefully. To this end, we include a fourth and fifth value indicating the confidence in the reported lineage (local signal) and confidence in the confidence values themselves on sequence level (global signal). These are again between 0 and 1, where 1 indicates high confidence. For more information, see the manuscript.

  1. <PREFIX>/raxtax.log is the log file where more or less useful information accumulates. With the default command line parameters, only warnings and errors will be collected. With -v additional information about runtime and the size of the database are printed. With -vv debug messages are also included. Generally, if a warning or error occurs, the program will inform you through stderr and refer you to the log file if needed. This file also contains information about exact matches and inconsistent lineages (possible mislabeling).

  2. (optional via --tsv) <PREFIX>/raxtax.tsv is pretty much the same as the first output file but slightly more convenient for viewing in your favorite spreadsheet editor. In this file, the taxonomic lineage and confidence values are interleaved, and the query sequence is also printed at the end:

query1  Arthropoda  1.0 Insecta 1.0 Diptera 0.8 Muscidae    0.68    Musca   0.52    Musca_domestica 0.31    0.67456 0.71234 ACTCGATAC

Other Options

--skip-exact-matches may be useful when running the database against itself to identify mislabeled sequences. Per default, raxtax skips over exact sequences matches if there is exactly one match and outputs a confidence of 1.0 for the exact match. This option makes it so that any exact match is not considered for the analysis of a query sequence.

--make-db can be used if you want to run the program with the same reference database for many different query files. If the reference database is large this will save significant time on repeat execution.

--threads may be omitted most of the time and raxtax will use as many cores as your system has available. Because the analysis is embarrassingly parallel, this is a sensible default. However, if you experience problems due to hyper-threading, you might want to reduce the number of threads, to increase parallel efficiency.

--redo will enable overwriting of existing output files. Use at your own risk!

--tp enables thread-pinning. On Linux, this will try to avoid hyper-threading and crossing sockets whenever possible. On other platforms, threads will still be pinned but in order of their IDs which might affect performance negatively.

Important Implementation Details

We suggest a threshold of 0.01 for confidence values to be considered (F64_OUTPUT_ACCURACY also in src/utils.rs). For technical reasons this is the number of digits after the decimal point, so currently this is 2.

If the database contains duplicate sequences that have different lineages above the lowest taxonomic level a warning will be emitted.

Why does this exist if there are so many taxonomic classifiers already, and how does it work?

We will soon publish a manuscript about this method and what we use it for.

Gigantic Databases

Per default, raxtax uses 32-bit indices for indexing reference sequences. This makes things a lot faster, but trying to run it with more than $2^{32}$ (~4 Billion) reference sequences will fail. In this case, compile it with --features huge_db to use 64-bit indices (on 64-bit systems). An error message will be displayed if too many reference sequences are used with the 32-bit indices version.

References

Edgar, Robert C. "SINTAX: a simple non-Bayesian taxonomy classifier for 16S and ITS sequences." biorxiv (2016): 074161.

This work is licensed under CC BY-NC-SA 4.0. To view a copy of this license, visit https://creativecommons.org/licenses/by-nc-sa/4.0/

Dependencies

~16–27MB
~405K SLoC