1 stable release

1.0.0 Oct 25, 2021

#4 in #bsdiff


Used in gwdiff

Apache-2.0 OR MIT

7KB
146 lines

gawires-diff

gawires-diff is the diff system used in the gawires version control system, it is a fork from bidiff and will expands upon it!

gawires-diff will keep the Apache OR MIT license.

The version 1.0.0 is a copy of bidiff, starting from 1.x.x it will diverge.

bidiff

CI Status

bidiff is a set of rust crates that generate and apply patches for arbitrary binary files.

It is particularly well-suited to deploying software updates, and:

  • Is written in 100% safe Rust
  • Has been fuzzed thoroughly
  • Has reasonable memory requirements
  • Supports large (>2GiB) inputs
  • Supports multiple compression formats (brotli, zstd, etc.) via comde

License

This project (bidiff, bipatch and bic) is licensed under either of

at your option.

Benchmarks

In this section, we compare using bidiff to generate patches against two versions of well-known software, versus just compressing the newer version with zstd (hereafter referred to as "naive").

All measurements are done on an Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz, with 12 partitions and 1 MiB chunks.

zstd compression at level 21 is used in both "naive" and "bidiff" runs:

older newer newer size naive size naive time bidiff size bidiff time
Wine 4.18 sources 4.19 sources 201 MiB 21.96 MiB 81.300s 182.26 KiB 5.220s
Linux 5.3.13 sources 5.4 sources 894.68 MiB 106.73 MiB 448.145s 6.14 MiB 57.799s
Chromium 78 binary (build 97) (build 108) 257.04 MiB 79.73 MiB 116.975 12.81 MiB 158.867s
Firefox 71 binary (build 11) (build 12) 192.49 MiB 54.64 MiB 78.039s 6.56 MiB 81.847s

Repository organization

This repository contains three crates:

  • crates/bidiff contains the diff algorithm, and has an optional enc feature for serialization.
  • crates/bipatch contains code that reads and applies patches generated by bidiff's enc feature.
  • crates/bic is a demonstration command-line interface to the above two crates, which uses comde for compression.

The essential part is contained in the bidiff crate itself. The serialization/deserialization code is provided as a (practical) example of how to store patch files. It is very simplistic: a magic number, a version number, and then a series of ADD, COPY and SEEK instructions, with variable-length integer encoding.

Note: bidiff and bipatch do not concern themselves with compression, but patch files MUST be compressed. Uncompressed, they are slightly larger than the "newer" file (but lower-entropy).

Diffing is a memory- and CPU-intensive task. The requirements are discussed below. Patching, however, can be done in streaming fashion and requires very little memory (only small buffers to perform addition between the "older" file and parts of the "patch" file).


Prior art: bsdiff

bidiff is based on the bsdiff algorithm as described in Naïve Differences in Executable Code (2003) by Colin Percival, but the implementation differs in a few key areas.

The key idea behind bsdiff is to find approximate matches between an "older" and a "newer" file, and then stored the difference between those, which ends up having lower entropy than the newer data itself. After compression, the resulting patch file is usually much smaller than the newer file would be if compressed.

The algorithm works in two phases:

  • Sorting, in which all suffixes of the "older" file are sorted, and their rank is stored in a suffix array. This is done using a SACA (Suffix Array Construction Algorithm).
  • Scanning, in which the suffix array is used to find approximate matches. The most promising approximate matches are selected using a simple scoring system.

The original bsdiff implementation tends to be quite time and memory-hungry.

In particular:

  • It requires max(17*n,9*n+m)+O(1) bytes of memory (where n and m are the size of the older and newer file respectively)
  • It stores "control", "difference" and "extra" data separately for "better compression"
  • It uses bzip2 as a compression method.
  • It uses a rather old SACA, described in Faster Suffix Sorting (1999) by Larsson and Sadakane.

What makes bidiff different

bidiff generates patches of similar size compared to bsdiff, but is generally faster and requires less memory, thanks to the following changes.

bidiff does suffix sorting using a hand-ported Rust version of divsufsort by Yuta Mori, a highly-optimized suffix sorting algorithm that has been repeatedly mentioned in research, despite the emergence of newer sorting methods.

Whilst not "the latest" in suffix sorting, it performs surprisingly well in practice (in near-linear time) and has modest memory requirements of O(5*n) (where n is the size of the "older" file).

Switching to divsufsort alone is already a significant improvement on the original bsdiff in terms of memory usage and time efficiency, but bidiff does not stop there.

For large file support, bidiff does not use the divsufsort64 variant, which would require O(9*n) memory (plus a larger constant). Instead, it introduces "partitioned suffix arrays". The "older" file is split into multiple partitions, each suffix sorted separately.

This achieves two things:

  • The maximum input size is no longer limited to 2**31-1 (due to using i32 as an index type)
  • Suffix sorting can be performed in parallel

In practice, suffix sorting in bidiff is performed in parallel using rayon, as long as the number of partitions is greater than one.

Note: switching to u32 as an index type to raise the original limit from 2GiB to 4GiB is not an option. divsufsort uses negative numbers abundantly while sorting, so the only options are i32 and i64.

Using partitioned suffix arrays has a downside: the cost of searching for the "longest matching substring" in the original text is no longer O(log n), but O(p*log n), where p is the number of partitions.

Since that operation is performed a great number of times while diffing, using a large number of partitions slows the scanning phase more than it speeds up the sorting phase.

For sample 20MiB Windows executable files:

partitions sort time scan time total diff time
1 1.2965049s 726.9464ms 2.031168s
2 739.773ms 1.0506574s 1.798709s
3 480.5651ms 1.4029111s 1.8912619s
4 369.3306ms 1.7776965s 2.1549498s
5 302.6711ms 2.3324006s 2.6430686s
6 280.6257ms 2.69022s 2.9789148s

Another downside is that, due to partitioning, very occasionally, worse matches are found, for substrings that cross the border between two partitions. In practice, this only very slightly affects patch size, as the rest of the match is found immediately thereafter, resulting in two "ADD" operations rather than one, which still compresses nicely.

Seeing these results, one may consider performing binary searches in each suffix array in parallel. However, experimentation has shown that, since those are very short operations repeated many times, the cost of synchronization overshadows any gains one might have hoped for.

bidiff addresses this another way, by parallelizing the scanning phase. The "newer" file is split into "chunks", and each of them are diffed in parallel (by as many workers as is efficient given the number of cores available). The results are a series of approximate matches, which are reordered and translated into a patch file by another worker.

Note that dividing the newer file into chunks also contributes to generating very slightly worse patch files: cross-chunk matches will result in two ADD operations, where they would only be one before. However, in practice, the difference is negligible.

This technique results in notable speedups.

For the same inputs as before:

partitions chunk size sort time scan time total time
1 32 MiB 1.2896446s 669.5845ms 1.967119s
1 16 MiB 1.2831451s 501.6626ms 1.7926627s
1 8 MiB 1.2936768s 397.7302ms 1.6991965s
1 4 MiB 1.2878612s 327.009ms 1.622483s
1 2 MiB 1.2888983s 225.2414ms 1.5213945s
1 1 MiB 1.2881292s 176.2079ms 1.4714389s

When parallelizing scanning, sort time begins to dominate. Using a combination of both techniques allow optimizing for core utilization and overall runtime:

partitions chunk size sort time scan time total diff time
1 32 MiB 1.3027385s 665.7531ms 1.9764095s
1 4 MiB 1.2900258s 352.3892ms 1.6501501s
1 512 KiB 1.3043592s 143.3714ms 1.4548654s
2 32 MiB 706.5673ms 1.0025483s 1.7168672s
2 4 MiB 723.699ms 471.5766ms 1.2028935s
2 512 KiB 703.7808ms 201.0271ms 911.9539ms
4 32 MiB 358.5569ms 1.7222684s 2.0884749s
4 4 MiB 356.8114ms 828.6099ms 1.1930674s
4 512 KiB 356.1849ms 330.2086ms 693.5397ms
6 32 MiB 273.307ms 2.582831s 2.8640321s
6 4 MiB 264.5068ms 1.1823795s 1.4542472s
6 512 KiB 240.1568ms 444.4659ms 691.7743ms

These are one-off measurement on relatively small files (20 MiB) but this is representative of the tradeoff that's done here - performance varies similarly for larger files (and various kinds of files).

In this case, the overall speedup achieved is almost 3x, on a 6-core machine. The optimal parameters for "partitions" and "chunk size" depend on the machine used to diff, but are generally along the lines of:

  • partitions = (num_cores - 1) (this leaves a core for bookkeeping and compression)
  • chunk size = newer_size / (num_cores * k), where k is between 2 and 4;

Note: efforts to parallelize divsufsort and still have a single suffix array have been attempted. However, the speedups are rather disappointing - the parallel variant needs 4 to 8 cores before it starts beating the sequential version. See for yourself.

Choosing a chunk size that's too large will result in suboptimal core utilization, whereas choosing a chunk size that's too small will result in increased memory usage for diminishing returns:

partitions chunk size sort time scan time total diff time
6 512 KiB 246.8454ms 460.2141ms 714.4462ms
6 128 KiB 240.9415ms 408.1361ms 656.3144ms
6 64 KiB 243.5352ms 426.3133ms 676.7965ms
6 16 KiB 253.8723ms 420.6165ms 681.7169ms
6 4 KiB 257.9093ms 424.8433ms 689.8088ms
6 2 KiB 254.1255ms 454.6246ms 715.69ms
6 1 KiB 250.7823ms 497.2212ms 755.1542ms

Dependencies

~155KB