5 releases (3 stable)
1.0.2 | Feb 6, 2024 |
---|---|
1.0.1 | May 25, 2022 |
1.0.0 | May 24, 2022 |
0.2.1 | May 22, 2022 |
0.1.1 | May 21, 2022 |
#148 in Science
48KB
672 lines
pseqsid
Calculates pairwise sequence identity, similarity and normalized similarity score of proteins in a multiple sequence alignment.
Usage
USAGE:
pseqsid [OPTIONS] <MSA>
ARGS:
<MSA> Multiple Sequence Alignment file
OPTIONS:
-i, --identity Calculate pairwise sequence identity
-s, --similarity Calculate pairwise sequence similarity
-n, --nss Calculate pairwise sequence Normalized Similarity Score
-l, --length <LENGTH> Sequence length to be use for identity and similarity calculations
[default: smallest] [possible values: smallest, mean, largest,
alignment]
-g, --grouping <GROUPING> Similarity amino acid grouping definition file. A default one is
created if required and not provided
-m, --matrix <MATRIX> Type of matrix to be used for Normalized Similarity Score [default:
blosum62] [possible values: blosum62, pam250, gonnet]
-p, --po <PO> Gap opening penalty (Po) [default: 10.0]
-e, --pe <PE> Gap extending penalty (Pe) [default: 0.5]
-t, --threads <THREADS> Number of threads to use. 0 use all available threads [default: 0]
-h, --help Print help information
-V, --version Print version information
About the argument and options
<MSA>
is the filepath of a protein multiple sequence alignment in FASTA format. Only standard amino acids and gap (-
) are accepted. An alignment of DNA sequences will be interpreted as proteins with a bunch of Ala, Gly, Thr and Cys residues, giving meaningless similarity and Normalized Similarity Score (NSS) values.
-i
--identity
activates the calculation of pairwise identity between the entries in the alignment. Sequence identity is defined as the percentage of identical residues in equivalent positions with respect to the sequence length. Which sequence length type is used is defined by -l
option. Gaps-only columns are ignored for all sequence length types, except for alignment
.
-l
--length
sequence length to be used. If the given sequence pair has gaps there are four possible different sequence length values to be used: the smallest, the average (mean), largest and the alignment length (this last considers gap-only columns, that are ignored for the previous three). Which type of length to select will depend on the intended use of your analysis. For example, if you have a hypothetical protein prot_short
and you are using a template protein template_long
to model prot_short
structure, and prot_short
has shorter sequence than template_long
, then using -l
smallest
will tell you the identity value you need to asses how feasible you homology model could be. In other situation, if you want to use the identity value to know how similar any given pair of sequence is in your alignment, you should probably use the -l
mean
or alignment
.
-s
--similarity
activates the calculation of pairwise similarity between the entries in the alignment. Sequence similarity is defined as the percentage of identical or similar residues in equivalent positions with respect to the sequence length. Most of what was explained above for identity applies to similarity too. Amino acid similarity groups can be defined in a file and provided with the -g
option.
-g
--grouping
filepath to the similarity amino acid grouping file. If none is provided and -s
option is given, then a default one, named default_aa_similarity_groups.txt is created and used. This file format is simple: aa group name
: single_letter_aa_names
. #
symbol comments out the rest of the line. Each group name can be defined just once and any amino acid can belong to only one group (or none). Only standard amino acids are accepted. Each group must have at least two amino acids. Use default file default_aa_similarity_groups.txt as a template for custom grouping definitions if needed.
-n
--nss
activates the calculation of pairwise Normalized Similarity Score, using the formula:
$S = \frac{(\sum M_{ij} - oP_o -eP_e)(\sum M_{ii} + \sum M_{jj})}{2\sum M_{ii}\sum M_{jj}}$
Where:
$M_{ij}$ are log-odds values for each pair of aligned amino acids ij obtained from substitution matrices. Which substitution matrix to use is defined by the -m
option.
$M_{ii}$ and $M_{jj}$ are the log-odds values corresponding to conserving the given residue for the first and the second sequence in the pair, respectively.
$o$ is the number of gap openings in the sequence pair.
$P_o$ is the gap opening penalty, as defined by -p
option.
$e$ is the total number of gaps in the sequence pair. Gap-only columns are ignored.
$P_e$ gap extension penalty, as defined by -e
option.
-m
--matrix
substitution matrix to be used for normalized similarity score calculation. Values corresponding to ambiguous amino acid definitions (such as B
, Z
, X
) are ignored and a new column and row is added for gaps -
with value zero. All three matrices are public, and available from multiple sources, in case you want to inspect them.
-p
--po
gap opening penalty to be used for normalized similarity score calculation.
-e
--pe
gap extension penalty to be used for normalized similarity score calculation.
Output
Identity, similarity and NSS matrices are saved as CSV files, which filenames are derived from <MSA>
, appending a proper substring. As these are symmetric matrices only the lower half is saved. Diagonals are ignored as identity and similarity of each protein sequence with itself is 100%, and NSS is 1.
Installation
Using snap:
snap install pseqsid
Using cargo:
cargo install pseqsid
Or you can download and build the crate yourself.
Dependencies
~4–11MB
~122K SLoC