#version-control #control-system #mathematica #parser #cli-parser

app mathematica-notebook-filter

mathematica-notebook-filter parses Mathematica notebook files and strips them of superfluous information so that they can be committed into version control systems more easily

9 releases

0.2.2 Feb 20, 2019
0.2.0 Jun 13, 2018
0.1.6 Apr 4, 2018
0.1.5 Feb 23, 2018
0.1.2 Sep 14, 2017

#805 in Text processing

29 downloads per month

GPL-3.0 license

115KB
2K SLoC

Rust 1K SLoC // 0.1% comments Wolfram 634 SLoC // 0.2% comments Shell 110 SLoC // 0.1% comments

mathematica-notebook-filter

mathematica-notebook-filter is a program written in Rust that parses Mathematica notebook files and strips them of superfluous information so that they can be committed into version control systems more easily. Instructions to integrate this program into version control systems can be found below and can be set up so that this is all done transparently without modifying the file on disk.

Crates.io Travis Codecov

Licensed under GPLv3.

This program has not been rigorously tested. It works for me on all my Notebooks, but there may still be some situations which have not been accounted for. If you use this program, please let me know (both good and bad feedback).

Introduction

Version control systems (such as git and mercurial among many others) provide a fantastic way to keep track of changes to files in such a way that multiple people can collaborate on them without accidentally overwriting other people's changes. Version control systems primarily keep track of source code and if two people change the same file, it is possible to compare the two files side-by-side so that the changes can be merged.

Although binary files (such as compiled outputs, images, PDFs, ...) can be included in a version control system too, it is generally not possible or meaningful to compare two sets of changes to one binary file. As a result, binary files are quite opaque to version control systems and it is inadvisable to store binary files in a version control system if they will be changed frequently.

This is specifically an issue for Mathematica notebooks as they store both inputs and outputs in the same file. A quite typical example of this is the simple input:

Plot[Sin[x] / x, {x, -4 Pi, 4 Pi}]

which, when plotted, is stored in the Notebook file as:

GraphicsBox[{{{{}, {},
  TagBox[
       {RGBColor[0.368417, 0.506779, 0.709798], AbsoluteThickness[1.6],
        Opacity[1.], LineBox[CompressedData["<omitted>"]],
        LineBox[CompressedData["<omitted>"]]},
       Annotation[#,
        "Charting`Private`Tag$5185#1"]& ], {}}, {{}, {}, {}}}, {}, {}},
   AspectRatio->NCache[GoldenRatio^(-1), 0.6180339887498948],
   Axes->{True, True},
   AxesLabel->{None, None},
   AxesOrigin->{0, 0},
   DisplayFunction->Identity,
   Frame->{{False, False}, {False, False}},
   FrameLabel->{{None, None}, {None, None}},
   FrameTicks->{{Automatic,
      Charting`ScaledFrameTicks[{Identity, Identity}]}, {Automatic,
      Charting`ScaledFrameTicks[{Identity, Identity}]}},
   GridLines->{None, None},
   GridLinesStyle->Directive[
     GrayLevel[0.5, 0.4]],
   ImagePadding->All,
   Method->{
    "DefaultBoundaryStyle" -> Automatic, "DefaultMeshStyle" ->
     AbsolutePointSize[6], "ScalingFunctions" -> None,
     "CoordinatesToolOptions" -> {"DisplayFunction" -> ({
         (Identity[#]& )[
          Part[#, 1]],
         (Identity[#]& )[
          Part[#, 2]]}& ), "CopiedValueFunction" -> ({
         (Identity[#]& )[
          Part[#, 1]],
         (Identity[#]& )[
          Part[#, 2]]}& )}},
   PlotRange->
    NCache[{{(-4) Pi, 4 Pi}, {-0.21723358083481298`,
      0.9999892952885239}}, {{-12.566370614359172`,
     12.566370614359172`}, {-0.21723358083481298`, 0.9999892952885239}}],
   PlotRangeClipping->True,
   PlotRangePadding->{{
      Scaled[0.02],
      Scaled[0.02]}, {
      Scaled[0.05],
      Scaled[0.05]}},
   Ticks->{Automatic, Automatic}]

Note that the above snippet was significantly abbreviated as the compressed base-64 encoded data is an additional 300 lines or so.

For the version control system, this large output is extremely cumbersome as a small change in the input (such as replacing Sin[x] with Sin[2 x]) will produce a 300+ line diff. The purpose of mathematica-notebook-filter is specifically to avoid such large diffs and try and make them much more meaningful. It does so by parsing the Mathematica notebook file format and removing all the output cells and metadata. The program is implemented in Rust and distributed on crates.io.

Having said that, it should be noted that Mathematica unfortunately does not store the input in a very simple form as it not only stores the plain Mathematica expression, but also stores formatting information. As a concrete example, an input cell with the above plot function will be stored in the Notebook file as:

Cell[BoxData[
 RowBox[{"Plot", "[",
  RowBox[{
   FractionBox[
    RowBox[{"Sin", "[", "x", "]"}], "x"], ",",
   RowBox[{"{",
    RowBox[{"x", ",",
     RowBox[{
      RowBox[{"-", "4"}], "Pi"}], ",",
     RowBox[{"4", "Pi"}]}], "}"}]}], "]"}]], "Input"]

The change of Sin[x] to Sin[2 x] results in the cell now being stored as:

Cell[BoxData[
 RowBox[{"Plot", "[",
  RowBox[{
   FractionBox[
    RowBox[{"Sin", "[",
     RowBox[{"2", "x"}], "]"}], "x"], ",",
   RowBox[{"{",
    RowBox[{"x", ",",
     RowBox[{
      RowBox[{"-", "4"}], "Pi"}], ",",
     RowBox[{"4", "Pi"}]}], "}"}]}], "]"}]], "Input"]

This program, at least at this stage, will not strip the extra formatting information. If you wish to avoid the above, then you should save your notebooks as scripts files (with extension .wl or .m).

Usage Notes

mathematica-notebook-filter parses Mathematica notebook files (usually stored with the extension .nb) and strips all generated outputs and other metadata. By default, the program reads from standard input and outputs to standard output. Additional usage information can be obtained from

mathematica-notebook-filter --help

Although it is possible to use mathematica-notebook-filter manually, it is designed to be integrated with version control systems (see below for instructions) such that Notebooks are first piped through the filter before the diffs are generated. This is specifically designed so that original file is left untouched with all outputs and metadata remaining, and the filter effectively makes the version control system blind to the extra content.

If you wish to run it manually, a simple call would be:

mathematica-notebook-filter -i my_notebook.nb -o my_notebook_cleaned.nb

If both input and output files are identical, the program will first output to a temporary file and only after successfully parsing the whole input will the original file be replaced.

This program does not parse the Wolfram language in general and is specific to full Mathematica notebooks; thus it makes some fairly strong assumptions about the functions that will be found and their order. It only parses a single Notebook at a time and will stop after the end of the first Notebook. If an error is encountered during the parsing, mathematica-notebook-filter will exit with a non-zero code and the output will be left incomplete.

It also should be re-iterated that the best way to commit Mathematica code to a version control system is to save the code in script files (.wl or .m). When doing so, Mathematica save the file in a very simple format (essentially a plain text file), without the superfluous formatting information and without outputs. This unfortunately has the disadvantage that the Notebook interface is not available.

Also note that Mathematica notebooks allow you to copy-paste graphics (such as generated plots) and use them as inputs. If you do so, the version control system will be forced to include the full plot in the diff, thereby defeating the point of mathematica-notebook-filter. An alternative to copy-pasting outputs is to store the output into a variable, or use % (and %%, %%%, ...) to refer to the previous output (though make sure to only use % within the one cell and not across cells as % refers to the last generated output, not the previous output in the Notebook order).

Installation

This program is written in Rust. Probably the easiest way to install Rust is to use the rustup.rs script. Once set up, it should simply be a matter of running

cargo install mathematica-notebook-filter

This will download, compile, and install mathematica-notebook-filter in your Cargo home direction (~/.cargo by default on Linux). Assuming you have correctly set up your PATH variable (which rustup.rs should have done automatically), then you can execute the program by typing mathematica-notebook-filter.

Integration

Git

It is possible to set attributes based on pattern globs. In this instance, we want to make sure that all *.nb files are processed by this filter before being committed. To globally set the attribute, add to ~/.gitattributes:

*.nb    filter=dropoutput_nb

and to your ~/.gitconfig:

[filter "dropoutput_nb"]
    clean = mathematica-notebook-filter
    smudge = cat

Other

Pull requests to add instructions for other version control system are welcome.

Disclaimer

The Wolfram Research organization unfortunately does not appear to offer any specification to their language or their file formats. As a result, this filter was entirely developed by inspecting outputs generated by Mathematica. Specifically, this was developed using Mathematica 11.1 and thus there is no guarantee that this filter will work with past or future version of the Notebook file format.

If you find a bug, please feel free to open an issue though please provide enough information to reproduce the bug or a minimal example of a Notebook file that causes the issue.

Contributing

Pull requests to improve compatibility with other versions (or to fix bugs) are very welcome. If you find a bug, please feel free to open an issue and make sure to provide enough information to reproduce the bug or a minimal example of a Notebook file that causes the issue.

Dependencies

~4–13MB
~159K SLoC