2 stable releases

2.0.1	Dec 4, 2024

#293 in Machine learning

GPL-3.0-or-later

140KB
1.5K SLoC

Dual Threshold Optimization

This library provides a comprehensive toolkit for performing Dual Threshold Optimization (DTO) originally proposed by Kang et al.

DTO compares two ranked lists of features (e.g. genes) to determine the rank threshold for each list that minimizes the hypergeometric p-value of the overlap of features. This method was originally applied to comparing the results of a paired set of binding assay results and perturbation assay results for a given transcription factor, but could be used with any two ranked lists of features.

DTO provides the following statistics on the optimal threshold pair and overlap:

An empirical p-value of the optimal overlap which is derived from a permutation-based null distribution.
An FDR estimation based on the derivations detailed in the paper linked above.

This crate offers both a library to incorporate DTO into other workflows, and a command-line binary for standalone use.

Note: Starting with version 2.0.0, the method was fully re-implemented in Rust. For the original implementation, which is the version used in the paper linked above, see version 1.0.0, implemented by Yiming Kang, in the releases.

Getting Started
Algorithmic Details
Troubleshooting
Acknowledgements

Getting started

Binaries for Linux, MacOS and Windows are provided in in the release tab. There are two flavors of releases for each OS:

The standard release: this is suitable for almost every user
An MPI enabled version: only if you want to parallelize across multiple machines. NOTE: For this version, MPI must be installed on the host system.

You can download pre-compiled binary for your operating system in the releases tab. MPI versions only exist for ubuntu. If one of the provided binaries doesn't work for your operating system, you can use the developer instructions to download the rust toolchain and compile a binary. Alternatively, open an Issue and we will help.

Installation

If you are on a Mac, for example, and you do not need MPI (most users), then you would download the binary called dual_threshold_optimization-macos-latest-default from the releases tab. There is also a windows executable, and both a default (non-mpi) and mpi version for ubuntu (which will work on most linux OS).

After downloading to your computer, you will need to make this executable by entering

chmod +x dual_threshold_optimization-macos-latest-default

in your terminal. For windows, if you are not using the terminal, consult the internet for the equivalent.

You may also want to rename the executable to something more manageable, e.g.

mv dual_threshold_optimization-macos-latest-default dual_threshold_optimization

to rename the executable to simply dual_threshold_optimization.

Using the cmd line

With the correct binary, you can print the help message like so:

dual_threshold_optimization --help

Dual Threshold Optimization CLI

Usage: dual_threshold_optimization [OPTIONS] --ranked-list1 <FILE> --ranked-list2 <FILE>

Options:
  -1, --ranked-list1 <FILE>
          Path to the first ranked feature list (CSV format).
          
          This should have two columns: "feature" and "rank". There should be **NO HEADER**.
          
          Rank is expected to be an integer. It is recommended that ties are handed with the `min` or `max` method.

  -2, --ranked-list2 <FILE>
          Path to the second ranked feature list (CSV format)
          
          This should have two columns: "feature" and "rank". There should be **NO HEADER**.
          
          Rank is expected to be an integer. It is recommended that ties are handed with the `min` or `max` method.

  -b, --background <FILE>
          Path to the background feature list (one feature per line, optional)

  -p, --permutations <PERMUTATIONS>
          Number of permutations to perform
          
          [default: 1000]

  -t, --threads <THREADS>
          Number of threads to use per task. For single-node parallelization, this will be the number of threads available on the machine.
          
          For multi-node parallelization, this will be the number of threads per task.
          
          Example: If you submit via Slurm with `-ntasks 4 --cpus-per-task 10`, then this value should be set to 10. This configuration will run 40 permutations in parallel.
          
          [default: 1]

  -m, --multi-node
          Enable multi-node mode using MPI. This requires that the program has been built with the `mpi` feature enabled

  -h, --help
          Print help (see a summary with '-h')

  -V, --version
          Print version

You can run this with the following minimal test data:

input list examples: list1, list2
background example: background

like this

# download list1
wget https://raw.githubusercontent.com/cmatKhan/Dual_Threshold_Optimization/refs/heads/rust_implementation/test_data/ranklist1.csv
# download list2
wget https://raw.githubusercontent.com/cmatKhan/Dual_Threshold_Optimization/refs/heads/rust_implementation/test_data/ranklist2.csv

# run the binary
dual_threshold_optimization -1 ranklist1.csv -2 ranklist2.csv -p 5 -t 1

This will output some run information to stderr, and a json to stdout. The json in the stdout is the output of the program. This is important because it means that you can re-direct the stdout to a file (see below) without saving the run metadata.

Output

The output is a json format string to stdout. To redirect this to a file, you would do the following:


dual_threshold_optimization -1 ranklist1.csv -2 ranklist2.csv -p 5 -t 1
 > output.json

This is what the output will look like:

{
  "empirical_pvalue": 0.8,
  "fdr": 0.0,
  "population_size": 30,
  "rank1": 24,
  "rank2": 13,
  "set1_len": 24,
  "set2_len": 13,
  "unpermuted_intersection_size": 12,
  "unpermuted_pvalue": 0.15632183908046102
}

Where the fields are the following:

empirical_pvalue: The quantile of the unpermuted minimum p-value in relation to the series of permuted minimal p-values
fdr: The false discovery rate where the sensitivity is set to 0.8. See the DTO paper for more details
population_size: The size of the background. If no background is explicity provided, this is the length of the input lists (when no background is provided, the lists must contain the same set of features)
rank1: The optimal rank for the unpermuted minimum p-value for list 1
rank2: The optimal rank for the unpermuted minimum p-value for list 2
set1_len: The number of features with rank less than or equal to rank1 in list1
set2_len: The number of features with rank less than or equal to rank2 in list2
unpermuted_intersection_size: The number of genes in the intersection of list1 and list2 with rank less than or equal to their respective optimal ranks
unpermuted p-value the optimal p-value of the unpermuted lists. For analysis purposes, the empirical p-value should be used.

Using the library

To use the library in your own Rust program, you can cargo add dual_threshold_optimization in your rust project. See the crates.io documentation for more information about what is provided in each of the submodules.

Developer installation and usage

It is assumed that you have the rust toolchain already installed.

git clone this repository
cd into Dual_Threshold_Optimization

For any of the commands below, you can add --features mpi to include the MPI feature. But, remember that this requires that MPI exist in your environment (e.g. openMPI)

You can build an optimized binary with

cargo build --release

You can run the tests with:

cargo test

and you can run the debug binary with

cargo run -- --help

Note that there is a build profile for time and memory performance profiling which will build a release version with the debug flags on:

cargo build --profile release-debug

Test data

Minimal test data can be found in the test_data subdirectory

Performance Profiling

I recommend profiling with hyperfine for runtime and heaptrack for memory. The results of profiling on the test data are in the /profiling subdirectory. Use the release-debug profile to build an executable for performance profiling.

Pre-commit and CI

Pre-commit is set up to run cargo fmt and clippy when you commit changes. There is also github actions CI set up to run the test suite, the linters (fmt and clippy), and on pulls to main, to create a release. In order for the release workflow to succeed, the version in Cargo.toml must not be the same as the current state of main. The release CI will build the binaries and add them to the release. You are responsible for updating the release notes after the workflow completes.

Algorithmic details

The following provides details on the DTO algorithm, step by step.

Initialize two ranked feature lists

Begin with two ranked lists of features, e.g. genes, where each feature has an id, e.g. a unique identifier for the gene, and a rank. The rank must be an integer and is expected to have ties handled with a method such as "min" or "max" where ties all are assigned the same rank.
Create a series of thresholds for each list based on the ranks

For each list, generate a series of thresholds T1, T2, ... . These thresholds are used to generate sets of features from each list to compare the overlap. The thresholds are calculated by the recurrence relation

T₁ = 1
T_n = ⌊ T_n-1 * 1.01 + 1 ⌋

The stopping condition is when the threshold meets or exceeds the largest rank. The final threshold is always set to the max rank. This series provides finer spacing at higher ranks, allowing more granular selection among top-ranked genes.

The effect of this equation is that for the first 100 ranks, the thresholds increment at the same rate as the ranks, so we have 1, 2, 3, ... . At 100, the resolution decreases by 2, eg 100, 102, 104, ... . For every additional 100 ranks after this, the resolution decreases by 1, so for instance: 200, 203, 206, ..., 402, 407, ..., 1705, 1723, 1741
Conduct a brute force search of the threshold pairs to find an optimal overlap

For each possible pair of thresholds, select the genes from each list with rank less than or equal to the respective threshold. Calculate the hypergeometric p-value by intersecting the feature sets. This is the core of the algorithm with a complexity of O(n^2) where n is the length of the threshold lists.
Report the optimal threshold pair

Return the threshold pair that describes the respective rank of each list that produces the feature sets that result in the minimum hypergeometric p-value (one-sided, upper only) across all tested threshold pairs. This threshold pair is considered optimal for identifying significant overlap between two ranked feature lists.

CAVEAT: Though infrequent, due to the interplay between parameters of the hypergeometric distribution, it is possible that multiple sets yield the same p-value, including the minimal p-value. When this occurs on the minimal p-value, the threshold pair that yields the largest overlap is selected as optimal. When there are multiple threshold pairs that have the same p-value and the same intersect size, the first in the set is chosen arbitrarily.
Use permutations to generate a null distribution for the minimal p-value

To assess the statistical significance of the identified threshold pair, run steps 3 and 4 multiple times (e.g., 1000 times) on randomized versions of the ranked lists (features assigned to ranks arbitrarily). This creates a null distribution of the minimal p-value and allows calculation of an empirical p-value of observing the previously identified optimal threshold pair by chance.
Calculate false discovery rate (FDR)

In the DTO paper, an FDR is derived. This FDR is estimated for the optimal threshold pair.

Troubleshooting

If you are using the MPI binary, then you must have MPI in your environment. If you do have MPI installed, but you get an error similar to the one below:

./dual_threshold_optimization: error while loading shared libraries: libmpi.so.40: cannot open shared object file: No such file or directory

Then you need to find where the libmpi.so.40 file lives and add it to your LD_LIBRARY_PATH manually. E.g.

export LD_LIBRARY_PATH=/ref/mblab/software/spack-0.22.2/opt/spack/linux-rocky9-x86_64/gcc-11.4.1/openmpi-5.0.3-vjscapwoywmullqs3lj2mmdf7vyge4rk/lib:$LD_LIBRARY_PATH

Acknowledgements

DTO was originally implemented by Yiming Kang. See version 1.0.0 for that implementation. That work was described in Dual threshold optimization and network inference reveal convergent evidence from TF binding locations and TF perturbation responses

Dependencies

~9MB
~172K SLoC

bin+lib dual_threshold_optimization