1 stable release
1.0.2 | Dec 29, 2022 |
---|---|
1.0.1 |
|
1.0.0 |
|
#1664 in Algorithms
572 downloads per month
Used in 3 crates
(via kn)
36KB
204 lines
Powierża coefficient
Powierża coefficient is a statistic on strings for gauging whether a string is an "abbreviation" of another. The function is not symmetric so it is not a metric.
- Let
T
(text) be a non-empty string. - Let
P
(pattern) be a non-empty subsequence ofT
. - Let
p
be a partition ofP
andp_i
be its elements, where:- every
p_i
is equal to some substring ofT
,t_i
. - the substrings
t_i
do not overlap. t_i
are in the same order asp_i
.
- every
Powierża coefficient is the number of elements of the shortest partition p
, less one. Alternatively, it is the number of gaps between the substrings t_i
.
Used terms:
- A substring is a subsequence made of consecutive elements only. A subsequence doesn't have to be a substring. For example,
xz
is a subsequence ofxyz
but it is not its substring. - A partition of a sequence is a sequence of pairwise disjoint subsequences that, when concatenated, are equal to the entire original sequence.
Intuitive explanation
Take all characters from the pattern and, while perserving the original order, align them with the same characters in the text so that there are as few groups of characters as possible. The coefficient is the number of gaps between these groups.
Examples
P |
T |
p |
Powierża coefficient |
---|---|---|---|
powcoeff |
powierża coefficient |
pow , coeff |
1 |
abc |
a_b_c |
a , b , c |
2 |
abc |
abc |
abc |
0 |
abc |
xyz |
— | not defined |
For more examples, see tests.
Use case
The Powierża coefficient is used in kn
and in nushell
to determine which of the directories' names better match the abbreviation. Many other string coefficients and metrics were found unsuitable, including Levenshtein distance. Levenshtein distance is biased in favour of short strings. For example, the Levenshtein distance from gra
to programming
is greater than to gorgia
, even though it does not "resemble" the abbreviation. Powierża coefficient for these pairs of strings is 0 and 2, so programming
would be chosen (correctly).
Powierża algorithm
The algorithm was inspired by Wagner–Fischer algorithm . It is also very similar to a solution to the Longest Common Subsequence Problem. All of these algorithms are based on a matrix. Whereas in Wagner-Fischer algorithm (WF) there are 3 types of moves (horizontal, diagonal and vertical) in my algorithm there are only two — horizontal and diagonal. The main idea is that the 'cost' of a gap is always 1, no matter how long. (In WF the cost of a gap is it's length.)
That means the algorithm must differentiate between cells that were filled in horizontal moves and the ones that were filled in diagonal moves. The first type of cells are cells containing Gap(score)
; the second type — Continuation(score)
. A horizontal move results in Gap(score)
if the original cell contains Gap(score)
and in Gap(score + 1)
if the original cell contains Continuation(score)
. The algorithm prefers moves that result in lower score and a diagonal move over horizontal move if they result in the same score.
-
Create a matrix
m
rows byn
cols wherem
is the length ofS
andn
is the length ofP
.n
must be less or equal tom
. Each cell can either be empty (that's the initial state) or contain eitherGap(score)
orContinuation(score)
. -
Begin filling the matrix from left to right and from top to bottom. The first row is special —
xth
,yth
cell is set toContinuation(0)
if thexth
element ofS
and theyth
element ofP
are equal. Otherwise, is set toGap(score + cost)
wherescore
is the score of its left neighbor. If its left neighbor is empty, the cell is left empty as well. -
Other cells are filled according to these rules:
Let
x
bea
's upper-left neighbor andy
be its left neighbor:x _ y a
The cost of a diagonal move is 0 but such move is only possible if the
xth
element ofS
and theyth
element ofP
are equal and ifx
isn't empty. After the movea
is set toContinuation(score)
wherescore
isx
's score.The cost of a horizontal move is 0 if
y
containsGap
and 1 ify
containsContinuation
. Such move is only possible ify
isn't empty. After the movea
is set toGap(score + cost)
wherescore
isy
's score.- If there are no available moves, leave
a
empty. - If there's only one available move, make it.
- If there are two available moves and their scores are equal, make the horizontal move.
- If there are two available moves and their scores aren't equal, make the move with the least score.
- If there are no available moves, leave
-
Powierża coefficient is the least value in the last row. In some cases there are no values in the last row and the coefficient is not defined.
Illustration
Cells with G's were filled in horizontal moves and those with C's were filled in diagonal moves. The numbers next to the letters are cells' scores. Red cells were skipped because of an optimization. Yellow cells were left empty. The coefficient is 2.
Benchmarks
The algorithm was compared with strsim's levenshtein
in a benchmark run on the author's computer:
- Levenshtein distance:
[1.2908 µs 1.2946 µs 1.2987 µs]
- Powierża coefficient:
[1.7718 µs 1.7748 µs 1.7778 µs]