2 stable releases
2.0.1 | Dec 4, 2024 |
---|
#723 in Algorithms
140KB
1.5K
SLoC
Dual Threshold Optimization
This library provides a comprehensive toolkit for performing Dual Threshold Optimization (DTO) originally proposed by Kang et al.
DTO compares two ranked lists of features (e.g. genes) to determine the rank threshold for each list that minimizes the hypergeometric p-value of the overlap of features. This method was originally applied to comparing the results of a paired set of binding assay results and perturbation assay results for a given transcription factor, but could be used with any two ranked lists of features.
DTO provides the following statistics on the optimal threshold pair and overlap:
- An empirical p-value of the optimal overlap which is derived from a permutation-based null distribution.
- An FDR estimation based on the derivations detailed in the paper linked above.
This crate offers both a library to incorporate DTO into other workflows, and a command-line binary for standalone use.
Note: Starting with version 2.0.0, the method was fully re-implemented in Rust. For the original implementation, which is the version used in the paper linked above, see version 1.0.0, implemented by Yiming Kang, in the releases.
Table of Contents
Getting started
Binaries for Linux, MacOS and Windows are provided in in the release
tab. There are
two flavors of releases for each OS:
-
The standard release: this is suitable for almost every user
-
An MPI enabled version: only if you want to parallelize across multiple machines. NOTE: For this version, MPI must be installed on the host system.
You can download pre-compiled binary for your operating system in the releases
tab.
MPI versions only exist for ubuntu. If one of the provided binaries doesn't work for
your operating system, you can use the
developer instructions to download the rust
toolchain and compile a binary. Alternatively, open an Issue and we will help.
Installation
If you are on a Mac, for example, and you do not need MPI (most users), then you would
download the binary called dual_threshold_optimization-macos-latest-default
from the
releases tab. There is also a windows executable, and both a default (non-mpi) and mpi
version for ubuntu (which will work on most linux OS).
After downloading to your computer, you will need to make this executable by entering
chmod +x dual_threshold_optimization-macos-latest-default
in your terminal. For windows, if you are not using the terminal, consult the internet for the equivalent.
You may also want to rename the executable to something more manageable, e.g.
mv dual_threshold_optimization-macos-latest-default dual_threshold_optimization
to rename the executable to simply dual_threshold_optimization
.
Using the cmd line
With the correct binary, you can print the help message like so:
dual_threshold_optimization --help
Dual Threshold Optimization CLI
Usage: dual_threshold_optimization [OPTIONS] --ranked-list1 <FILE> --ranked-list2 <FILE>
Options:
-1, --ranked-list1 <FILE>
Path to the first ranked feature list (CSV format).
This should have two columns: "feature" and "rank". There should be **NO HEADER**.
Rank is expected to be an integer. It is recommended that ties are handed with the `min` or `max` method.
-2, --ranked-list2 <FILE>
Path to the second ranked feature list (CSV format)
This should have two columns: "feature" and "rank". There should be **NO HEADER**.
Rank is expected to be an integer. It is recommended that ties are handed with the `min` or `max` method.
-b, --background <FILE>
Path to the background feature list (one feature per line, optional)
-p, --permutations <PERMUTATIONS>
Number of permutations to perform
[default: 1000]
-t, --threads <THREADS>
Number of threads to use per task. For single-node parallelization, this will be the number of threads available on the machine.
For multi-node parallelization, this will be the number of threads per task.
Example: If you submit via Slurm with `-ntasks 4 --cpus-per-task 10`, then this value should be set to 10. This configuration will run 40 permutations in parallel.
[default: 1]
-m, --multi-node
Enable multi-node mode using MPI. This requires that the program has been built with the `mpi` feature enabled
-h, --help
Print help (see a summary with '-h')
-V, --version
Print version
You can run this with the following minimal test data:
- input list examples: list1, list2
- background example: background
like this
# download list1
wget https://raw.githubusercontent.com/cmatKhan/Dual_Threshold_Optimization/refs/heads/rust_implementation/test_data/ranklist1.csv
# download list2
wget https://raw.githubusercontent.com/cmatKhan/Dual_Threshold_Optimization/refs/heads/rust_implementation/test_data/ranklist2.csv
# run the binary
dual_threshold_optimization -1 ranklist1.csv -2 ranklist2.csv -p 5 -t 1
This will output some run information to stderr, and a json to stdout. The json in the stdout is the output of the program. This is important because it means that you can re-direct the stdout to a file (see below) without saving the run metadata.
Output
The output is a json format string to stdout. To redirect this to a file, you would do the following:
dual_threshold_optimization -1 ranklist1.csv -2 ranklist2.csv -p 5 -t 1
> output.json
This is what the output will look like:
{
"empirical_pvalue": 0.8,
"fdr": 0.0,
"population_size": 30,
"rank1": 24,
"rank2": 13,
"set1_len": 24,
"set2_len": 13,
"unpermuted_intersection_size": 12,
"unpermuted_pvalue": 0.15632183908046102
}
Where the fields are the following:
- empirical_pvalue: The quantile of the unpermuted minimum p-value in relation to the series of permuted minimal p-values
- fdr: The false discovery rate where the sensitivity is set to 0.8. See the DTO paper for more details
- population_size: The size of the background. If no background is explicity provided, this is the length of the input lists (when no background is provided, the lists must contain the same set of features)
- rank1: The optimal rank for the unpermuted minimum p-value for list 1
- rank2: The optimal rank for the unpermuted minimum p-value for list 2
- set1_len: The number of features with rank less than or equal to rank1 in list1
- set2_len: The number of features with rank less than or equal to rank2 in list2
- unpermuted_intersection_size: The number of genes in the intersection of list1 and list2 with rank less than or equal to their respective optimal ranks
- unpermuted p-value the optimal p-value of the unpermuted lists. For analysis purposes, the empirical p-value should be used.
Using the library
To use the library in your own Rust program, you can
cargo add dual_threshold_optimization
in your rust project. See the crates.io
documentation for more information about what is provided in each of the submodules.
Developer installation and usage
It is assumed that you have the rust toolchain already installed.
- git clone this repository
cd
intoDual_Threshold_Optimization
For any of the commands below, you can add --features mpi
to include the MPI
feature. But, remember that this requires that MPI exist in your environment
(e.g. openMPI)
You can build an optimized binary with
cargo build --release
You can run the tests with:
cargo test
and you can run the debug binary with
cargo run -- --help
Note that there is a build profile for time and memory performance profiling which will build a release version with the debug flags on:
cargo build --profile release-debug
Test data
Minimal test data can be found in the test_data
subdirectory
Performance Profiling
I recommend profiling with hyperfine
for runtime and heaptrack for memory.
The results of profiling on the test data are in the /profiling
subdirectory. Use the
release-debug
profile to build an executable for performance profiling.
Pre-commit and CI
Pre-commit is set up to run cargo fmt
and clippy
when you commit changes. There is
also github actions CI set up to run the test suite, the linters (fmt
and clippy
),
and on pulls to main
, to create a release. In order for the release workflow to
succeed, the version in Cargo.toml
must not be the same as the current state of
main
. The release CI will build the binaries and add them to the release. You are
responsible for updating the release notes after the workflow completes.
Algorithmic details
The following provides details on the DTO algorithm, step by step.
-
Initialize two ranked feature lists
Begin with two ranked lists of features, e.g. genes, where each feature has an id, e.g. a unique identifier for the gene, and a rank. The rank must be an integer and is expected to have ties handled with a method such as "min" or "max" where ties all are assigned the same rank.
-
Create a series of thresholds for each list based on the ranks
For each list, generate a series of thresholds T1, T2, ... . These thresholds are used to generate sets of features from each list to compare the overlap. The thresholds are calculated by the recurrence relation
T1 = 1
Tn = ⌊ Tn-1 * 1.01 + 1 ⌋The stopping condition is when the threshold meets or exceeds the largest rank. The final threshold is always set to the max rank. This series provides finer spacing at higher ranks, allowing more granular selection among top-ranked genes.
The effect of this equation is that for the first 100 ranks, the thresholds increment at the same rate as the ranks, so we have 1, 2, 3, ... . At 100, the resolution decreases by 2, eg 100, 102, 104, ... . For every additional 100 ranks after this, the resolution decreases by 1, so for instance: 200, 203, 206, ..., 402, 407, ..., 1705, 1723, 1741
-
Conduct a brute force search of the threshold pairs to find an optimal overlap
For each possible pair of thresholds, select the genes from each list with rank less than or equal to the respective threshold. Calculate the hypergeometric p-value by intersecting the feature sets. This is the core of the algorithm with a complexity of O(n^2) where n is the length of the threshold lists.
-
Report the optimal threshold pair
Return the threshold pair that describes the respective rank of each list that produces the feature sets that result in the minimum hypergeometric p-value (one-sided, upper only) across all tested threshold pairs. This threshold pair is considered optimal for identifying significant overlap between two ranked feature lists.
CAVEAT: Though infrequent, due to the interplay between parameters of the hypergeometric distribution, it is possible that multiple sets yield the same p-value, including the minimal p-value. When this occurs on the minimal p-value, the threshold pair that yields the largest overlap is selected as optimal. When there are multiple threshold pairs that have the same p-value and the same intersect size, the first in the set is chosen arbitrarily.
-
Use permutations to generate a null distribution for the minimal p-value
To assess the statistical significance of the identified threshold pair, run steps 3 and 4 multiple times (e.g., 1000 times) on randomized versions of the ranked lists (features assigned to ranks arbitrarily). This creates a null distribution of the minimal p-value and allows calculation of an empirical p-value of observing the previously identified optimal threshold pair by chance.
-
Calculate false discovery rate (FDR)
In the DTO paper, an FDR is derived. This FDR is estimated for the optimal threshold pair.
Troubleshooting
If you are using the MPI binary, then you must have MPI in your environment. If you do have MPI installed, but you get an error similar to the one below:
./dual_threshold_optimization: error while loading shared libraries: libmpi.so.40: cannot open shared object file: No such file or directory
Then you need to find where the libmpi.so.40
file lives and add
it to your LD_LIBRARY_PATH
manually. E.g.
export LD_LIBRARY_PATH=/ref/mblab/software/spack-0.22.2/opt/spack/linux-rocky9-x86_64/gcc-11.4.1/openmpi-5.0.3-vjscapwoywmullqs3lj2mmdf7vyge4rk/lib:$LD_LIBRARY_PATH
Acknowledgements
DTO was originally implemented by Yiming Kang. See version 1.0.0 for that implementation. That work was described in Dual threshold optimization and network inference reveal convergent evidence from TF binding locations and TF perturbation responses
Dependencies
~9MB
~166K SLoC