6 releases

0.1.5 Nov 4, 2022
0.1.4 Nov 4, 2022
0.1.3 Oct 26, 2022

#161 in Command-line interface

Download history 18750/week @ 2023-11-21 24537/week @ 2023-11-28 22312/week @ 2023-12-05 22809/week @ 2023-12-12 15062/week @ 2023-12-19 9765/week @ 2023-12-26 16406/week @ 2024-01-02 16285/week @ 2024-01-09 18164/week @ 2024-01-16 17899/week @ 2024-01-23 15773/week @ 2024-01-30 13850/week @ 2024-02-06 14692/week @ 2024-02-13 15183/week @ 2024-02-20 15096/week @ 2024-02-27 12311/week @ 2024-03-05

59,358 downloads per month
Used in 56 crates (7 directly)

Apache-2.0

155KB
1.5K SLoC

imara-diff

crates.io crates.io crates.io

Imara-diff is a solid (imara in swahili) diff library for rust. Solid refers to the fact that imara-diff provides very good runtime performance even in pathologic cases so that your application never appears to freeze while waiting on a diff. The performance improvements are achieved using battle tested heuristics used in gnu-diff and git that are known to perform well while still providing good results.

Imara-diff is also designed to be flexible so that it can be used with arbitrary collections and not just lists and strings and even allows reusing large parts of the computation when comparing the same file to multiple different files.

Imara-diff provides two diff algorithms:

  • The linear-space variant of the well known Myers algorithm
  • The Histogram algorithm which variant of the patience diff algorithm.

Myers algorithm has been enhanced with preprocessing and multiple heuristics to ensure fast runtime in pathological cases to avoid quadratic time complexity and closely matches the behavior of gnu-diff and git. The histogram algorithm was originally ported from git but has been heavily optimized. The Histogram algorithm outperforms Myers algorithm by 10% - 100% across a wide variety of workloads.

Limitations

Even with the optimizations in this crate, performing a large diff without any tokenization (like character diff for a string) does not perform well. To work around this problem a diff of the entire file with large tokens (like lines for a string) can be performed first. The Sink implementation can then perform fine-grained diff on changed regions. Note that this fine-grained diff should not be performed for pure insertions, pure deletions and very large changes.

In an effort to improve performance, imara-diff makes heavy use of pointer compression. That means that it can only support files with at most 2^31 - 2 tokens. This should be rarely an issue in practice for textual diffs, because most (large) real-world files have an average line-length of at least 8. That means that this limitation only becomes a problem for files above 16GB while performing line-diffs.

Benchmarks

The most used diffing libraries in the rust ecosystem are similar and dissimilar. The fastest diff implementation both of these offer is a simple implementation of Myers algorithm without preprocessing or additional heuristics. As these implementations are very similar only similar was included in the benchmark.

To provide a benchmark to reflects real-world workloads, the git history of different open source projects were used. For each repo two (fairly different) tags were chosen. A tree diff is performed with gitoxide and the pairs of files that should be saved are stored in memory. The diffs collected using this method are often fairly large, because the repositories are compared over a large span of time. Therefore, the tree diff of the last 30 commit before the tag (equivalent of git diff TAG^ TAG, git diff TAG^^ TAG^^) were also used to also include smaller diffs.

The benchmark measures the runtime of performing a line diff between the collected files. As a measure of complexity for each change (M + N) D was used where M and N are the lengths of the two compared files and D is the length of the edit script required to transform these files into each other (determined with Myers algorithm). This complexity measure is used to divide the changes into 10 badges. The time to compute the line diffs in each badge was benchmarked.

The plots below show the runtime for each average complexity (runtime is normalized by the number of diffs). Note that these plots are shown in logarithmic scale due to the large runtime of similar for complex diffs. Furthermore, to better highlight the performance of the Histogram algorithm, the speedup of the Histogram algorithm compared to the Myers algorithm is shown separately.

Linux

The sourcecode of the linux kernel.

Rust

The sourcecode of the rust compiler, standard library and various related tooling.

VScode

The sourcecode of the vscode editor.

Helix

The sourcecode of the helix editor.

Dependencies

~1.5MB
~23K SLoC