#xml #xml-document #diff #difference #diffing #human-readable #natural

yanked natural-xml-diff

Natural diffing between XML documents

2 unstable releases

0.2.0 Jan 3, 2023
0.1.0 Dec 19, 2022

#16 in #diffing

MIT license

140KB
3K SLoC

natural-xml-diff

Crates.io Documentation

The natural-xml-diff crate implements a diffing algorithm that attempts to produce correct and human readable differences between two XML documents.

API Documentation

Algorithm

The algorithm implemented by this library is based on the paper "Bridging the gap between tracking and detecting changes on XML". It is also implemented by the Java-based jndiff library.

Work in progress

This is still a work in progress!

Credits

Paligo

Structural diffing

Let's consider the following XML document, taken from the "Bridging the Gap" paper:

<?xml version="1.0"?>
<book>
  <chapter>
    <title>Text 1</title>
    <para>Text 2</para>
  </chapter>
  <chapter>
    <title>Text 4</title>
    <para>Text 5</para>
  </chapter>
  <chapter>
    <title>Text 6</title>
    <para>Text 7<img/>Text 8</para>
  </chapter>
  <chapter>
    <title>Text 9</title>
    <para>Text 10</para>
  </chapter>
  <chapter>
    <para>Text 11</para>
    <para>Text 12</para>
  </chapter>
</book>

We'll call that "document A", the "before" of the diffing. Here's the "after", "document B":

<?xml version="1.0"?>
<book>
  <chapter>
    <para>Text 2</para>
  </chapter>
  <chapter>
    <title>Text 4</title>
    <para>Text 25</para>
    <para>Text 11</para>
  </chapter>
  <chapter>
    <title>Text 6</title>
    <para>Text 7<img/>Text 8</para>
  </chapter>
  <chapter>
    <title>Text 9</title>
    <para>Text 10</para>
  </chapter>
  <chapter>
    <para>Text 12</para>
  </chapter>
</book>

Let's present both as trees with numbered nodes (the root node, 0, is not shown). Here's document A:

graph TD;
    1[1 book]-->2
	  2[2 chapter]-->3
	  2-->5
	  3[3 title]-->4
	  4[4 Text 1]
	  5[5 para]-->6
	  6[6 Text 2]
	  1-->7
	  7[7 chapter] --> 8
	  8[8 title] --> 9
	  9[9 Text 4]
	  7 --> 10
	  10[10 para] --> 11
	  11[11 Text 5]
	  1 --> 12
	  12[12 chapter] --> 13
	  13[13 title] --> 14
	  14[14 Text 6]
	  12-->15
	  15[15 para] --> 16
	  15 --> 17
	  15 --> 18
	  16[16 Text 7]
	  17[18 img]
	  18[19 Text 8]
	  1 --> 19
	  19[19 chapter]
	  19 --> 20
	  20[20 title] --> 21
	  21[21 Text 9]
	  19 --> 22
	  22[22 para] --> 23
	  23[23 Text 10]
	  1 --> 24
	  24[24 chapter] --> 25
	  25[25 para] --> 26
	  26[26 Text 11]
	  24 --> 27
	  27[27 para] --> 28
	  28[28 Text 12]

Maintaining the tests

Some tests use test_generator to generate tests from the testdata directory. New tests in that directory aren't automatically picked up however; you have to force a recompile of the .rs files that run the tests to do so. You can do this by using a non-significant whitespace edit in each .rs file that uses test_generator and saving. I hope there's a better solution.

Dependencies

~4MB
~76K SLoC