### 7 releases

Uses old Rust 2015

0.1.6 | Sep 3, 2017 |
---|---|

0.1.5 | Mar 31, 2017 |

0.1.3 | Feb 20, 2017 |

#**2025** in Command line utilities

**GPL-3.0**license

580KB

2K
SLoC

# Introduction

is a Rust library and command-line tool that corrects a subtitle given
a second "correct" subtitle. It will figure out offsets and where to
introduce or remove advertisement breaks to get the best alignment possible. It does not use
any language information so it even works for image based subtitles like VobSub.`aligner`

## Usage

The most basic command is:

`$`` aligner reference_subtitle.ssa incorrect_subtitle.srt output.srt`

You can additionally adjust how much the algorithm tries to avoid introducing or removing a break:

`#`` split-penalty is a value between 0 and 100
$ aligner reference_subtitle.ssa incorrect_subtitle.srt output.srt --split-penalty 2.6
`

Currently supported are

, `.`srt

/`.`ssa

and `.`ass

files.`.`idx

## How to compile the binary

Install Rust and Cargo then run:

`#`` this will create ~/.cargo/bin/aligner
$ cargo install aligner
`

## How to use the library

Add this to your

:`Cargo .toml`

`[``dependencies``]`
`aligner ``=` `"`~0.1.6`"`

## Algorithm

At the core of the algorithm is the *rating* of an alignment. For each pair of subtitles (one from the reference subtitle on from the incorrect subtitle) the rating is

`overlapping_time``(`refsub`,` incsub`)` `/` `max``(``length``(`refsub`)``,` `length``(`incsub`)``)`

The maximum of this rating is 1, if and only if

. The total rating of an alignment is (for the most part) the sum of all ratings of all possible pairs. By moving the `refsub = incsub`

`incsubs`

around, we might get a better alignment. As a basic constraint, the order of the `incsubs`

will not be changed. So if we have to consecutive subtitles `start``(`incsubN`)` `<=` `start``(`incsubN`+``1``)`

, the corrected `'incsubs`

will still have `start``(``'incsubN``)` `<=` `start``(``'incsubN``+``1``)`

.If only this formula was used, the algorithm will probably create different offsets for each subtitle line. To avoid that, we have to use

value. For each consecutive subtitles where `split_penalty`

we add another `start``(`incsubN`)` `-` `start``(`incsubN`+``1``)` `=` `start``(``'incsubN``)` `-` `start``(``'incsubN``+``1``)`

to the total rating. That way, with every extra split we lose `split_penalty`

of rating.`split_penalty`

This algorithm computes the alignment which yields the maximum of all possible ratings. The algorithm is powered by the principle of dynamic programming.

To simplify the problem, we assume

is the start timestamp in milliseconds of a subtitle line `start``(`sub`)`

and `sub`

is true. Let's say `0` `<=` `start``(`sub`)`

computes the best rating/alignment for the first `get_rating``(`t`,` n`)`

incorrect subtitle lines with the additional constraint `n`

for each of these `0` `<=` `start``(`sub`)` `<=` t

subtitles.`n`

Of course we can now simply set

because if we have no incorrect subtitles to align, we have a rating of zero (independent of `get_rating``(`t`,` `0``)` `=` `0`

).`t`

Now we handle the case

. We can simply compute the overlapping rating (where the first incorrect subtitle starts at the "zero timepoint") with every reference subtitle and add up these values. With `get_rating``(``0``,` `1``)`

things get interesting. We can either have `get_rating``(``1``,` `1``)`

or `start``(`sub`)` `=` `0`

. Fortunaly we already have `start``(`sub`)` `=` `1`

, so we only need the rating where `get_rating``(``0``,` `1``)`

. This can be computed by adding up all overlapping ratings. Similarly we can compute `start``(`sub`)` `=` `1`

by taking the maximum of `get_rating``(``2``,` `1``)`

and the rating where `get_rating``(``1``,` `1``)`

. In this vein we can create `start``(`sub`)` `=` `2`

from `get_rating``(`t`+``1``,` `1``)`

. We can also speed up computing the overlapping rating, because the subtitle line will only be shifted by 1ms from `get_rating``(`t`,` `1``)`

to `start``(`sub`)` `=` t

. The subtitle will lose the rating for the segment `start``(`sub`)` `=` t `+` `1`

and gain the overlapping rating for the segment `[`t`,` t `+` `1``]`

on the other side. By creating a lookup-table for reference subtitles for every `[`t `+` `length``(`sub`)``,` t `+` `1` `+` `length``(`sub`)``]`

this process has a runtime of `t`

.`O (1)`

When do we stop? Well, the rating won't change anymore if

gets big. At the latest when `t`

is be greater than any of the reference subtitle lines, because after that the overlapping rating will always be zero. Let's call `start``(`sub`)`

the timepoint where all incorrect subtitles have been moved behind the reference subtitles. The best total rating is then `max_t`

.`get_rating``(``max_t``,` number_of_incorrect_subtitles`)`

Now we have all

to `get_rating``(``0``,` `1``)`

. To compute `get_rating``(``max_t``,` `1``)`

, which means that `get_rating``(``0``,` `2``)`

and `0` `<=` `start``(`sub0`)` `<=` `0`

. We already have the rating for `0` `<=` `start``(`sub1`)` `<=` `0`

in form of `sub0`

. We only need to add the overlapping rating for `get_rating``(``0``,` `1``)`

. To get `sub1`

we can either use `get_rating``(``1``,` `2``)`

(we leave `get_rating``(``0``,` `2``)`

where it is), or move `sub1`

to 1, which allows `start``(`sub1`)`

to be in `start``(`sub0`)`

. The best rating for `0` `<=` `start``(`sub0`)` `<=` `start``(`sub1`)` `=` `1`

for that range has been computed with `sub0`

, we only need to add the overlapping rating for `get_rating``(``1``,` `1``)`

. We proceed similarly to get `start``(`sub1`)` `==` `1`

: leave the `get_rating``(`t`+``1``,` `2``)`

like it was for `sub1`

or reposition the subtitle to `get_rating``(`t`,``2``)`

and use `start``(`sub1`)` `==` t`+``1`

.`get_rating``(`t`+``1``,``1``)` `+` `overlapping_rating``(`sub1`,`t`+``1``)`

With the same principle we proceed with

:`subN`

- initialize
`get_rating``(``0``,`n`)``=``get_rating``(``0``,`n`-``1``)``+``overlapping_rating``(`subN`,``0``)` - choose for

the maximum of`get_rating``(`t`+``1``,`n`)`

which means "leaving`get_rating``(`t`,`n`)`

" and`subN`

which means repositioning the`get_rating``(`t`+``1``,`n`-``1``)``+``overlapping_rating``(`t`+``1``,`subN`)``subN`

Until now we didn't use the

. We need to add the split penalty when `split_penalty`

is a specific value (the original distance `start``(`subN`)` `-` `start``(`subN`+``1``)`

). The trick here is seeing that we only need to consider the "repostion choice". The only time `diff``(`N`)`

is consulted after the inital phase is when `get_rating``(``_``,` n`-``1``)`

gets repositioned. `subN`

will then start at `subN`

and we consult `t +1`

`get_rating``(`t`+``1``,` n`-``1``)`

. So if `subN``-``1`

were positioned at `t``+``1``-``diff``(`N`-``1``)`

for `get_rating``(`t`+``1``-``diff``(`N`-``1``)``,` n`-``1``)`

we'd be able to get the `split_penalty`

. This is exactly the thing we will do when we are in a phase `n`

: We will not only have the "leave choice" or "reposition choice" but also the "nosplit choice". If we compute `get_rating``(`t`,` n`)`

, we can also compare the two other values with `get_rating``(`t`-`diffN`,` n`-``1``)` `+` `overlapping_rating``(`t`-`diffN`,` subN`)` `+` split_penalty

. The `get_rating``(`t`-`diffN`,` n`-``1``)` `+` `overlapping_rating``(`t`-`diffN`,` subN`)`

is again the best rating where `start``(`subN`)` `=` t`-`diffN

. We are allowed to add the `split_penalty`

because in the next phase `n``+``1`

, `subN``+``1`

will start at `t`

when `get_rating``(`t`-`diffN`,` n`)`

is looked up. So the final rating algorithm is:- initialize

with 0`get_rating``(`t`,``0``)` - initialize
`get_rating``(``0``,`n`)``=``get_rating``(``0``,`n`-``1``)``+``overlapping_rating``(`subN`,``0``)` - choose for

the maximum of`get_rating``(`t`+``1``,`n`)`

which means "leaving`get_rating``(`t`,`n`)`

" and`subN`

which means repositioning the`get_rating``(`t`+``1``,`n`-``1``)``+``overlapping_rating``(`t`+``1``,`subN`)`

and`subN`

which means doing a nosplit-repositioning for`get_rating``(`t`+``1``-`diffN`,`n`-``1``)``+``overlapping_rating``(`t`+``1``-`diffN`,`subN`)``+`split_penalty`subN`

To get the final alignment, we save for each phase

and `n`

where `t +1`

`subN`

was positioned (can be `t``+``1`

, `t``+``1``-`diffN

or the previous position). If we look up that value for `n ``=` number_of_incorrect_subtitles

and `t ``=` `max_t`

, we know where the last subtitles `subN`

has to be. We then know `start``(`subN`)`

. The best alignment of all previous subtitles is then computed with `get_rating``(``start``(`subN`)``,` n`-``1``)`

. So we look up the position for `subN``-``1`

in that table with `n``'` `=` n`-``1`

and `t``'` `=` `start``(`subN`)`

. That way we get all corrected positions of all incorrect subtitles and are done!Though this algorithm works (and was implemented in one of the early versions of

) it is neither fast nor space-efficient. Let's take a `aligner`

subtitle file which has about `45` minutes `=` `2700000` milliseconds `<` `max_t`

(these are realistic values). We build a table of `n = 900 subtitles`

`max_t` `*` n `=` `2430000000`

entries. We can discard the ratings of the phase `n-1`

after phase `n`

, but we always need to store the positions of the subtitle `subN`

. Let's assume we need 4 bytes to store them: we then have a table of `2430000000` `*` `4` bytes `=` `9720000000` bytes `=` `9` `GB`

of data in RAM!!! Even filling the table with zeros might take some noticeable time. But as it turns out we can compress that table in under 2 MB (most of the time; probably the best compression I've ever seen) with `delta encoding`

. The empirical foundation is that the choices almost never change from one `t`

to `t``+``1`

(about 10 to 1000 times for one phase). If we always take the- "leave choice", the position will always be

for every`t``+``1`

(rise by 1)`t``+``1` - "nosplit-reposition choice", the position will always be

(rise by 1)`t``+``1``-`diffN - "reposition choice", the position won't change from

(constant)`t``-``1`

So if we store values in a

tuple, where the first uncompressed value is `(`start`,` delta`,` length`)`

, the second is `start + 0 * delta`

`start ``+` `1` `*` delta

, the third is `start ``+` `2` `*` delta

, ..., the last is `start ``+` length `*` delta

, we can compress an entire phase into a few bytes. The same thing is applicable to the ratings. Without going into details: if we take the overlapping rating of a incorrect subtitle to a reference subtitle and "move" the incorrect subtitle from the far left to the far right we will have five segments of compressed values (first the rating will be zero, then rise linearly, then be constant, then fall linearly, then be zero again). The comparisons/choices can then be done for rating segments instead of single `t`

. This yields a speedup of at least one order of magnitude.#### Dependencies

~13MB

~260K SLoC