14 unstable releases (4 breaking)
0.5.1 | Dec 4, 2024 |
---|---|
0.5.0 | Oct 17, 2024 |
0.4.0 | Aug 28, 2024 |
0.3.7 | Jun 11, 2024 |
0.2.0 | Nov 29, 2023 |
#1227 in Development tools
76 downloads per month
125KB
3K
SLoC
nbwipers
nbwipers is a command line tool to wipe clean jupyter notebooks, written in Rust.
The interface and functionality are based on nbstripout and the idea to implement it in rust comes from nbstripout-fast.
Usage
nbwipers has a few subcommands that provide functionality related to cleaning Jupyter notebooks.
clean
: clean a single notebook. This is more-or-less equivalent tonbstripout
.check
: check notebooks in a given path for elements that would be removed byclean
. This could be used in a CI context to enforce clean notebooks.clean-all
clean all notebooks in a given path. This one should be used carefully!install
register nbwipers as a git filter foripynb
files. Equivalent tonbstripout --install
uninstall
remove nbwipers as a git filter.check-install
check thatnbwipers
ornbstripout
is installed in the local repo. This is used in the pre-commit hook.
The full options can be found in CommandLineHelp.md
.
Examples
To set up nbwipers as a git filter in your repository, use
nbwipers install local
If this step is performed on a pre-existing repo, you can touch
your notebooks so that git can detect the changes.
In bash:
for f in $(git ls-files '*.ipynb'); do touch $f; done
To check the notebooks in your folder, you can run the following
nbwipers check .
pre-commit
You can add the following to your pre-commit-config.yaml
file to ensure that nbwipers
or nbstripout
is installed in your repo, in order to prevent Jupyter notebook outputs from being committed to version control.
- repo: https://github.com/felixgwilliams/nbwipers-pre-commit
rev: v0.3.4
hooks:
- id: nbwipers-check-install
Alternatively, you can use the URL for this repo in your config, but this will compile nbwipers
from source, rather than retrieving the binary from PyPI, and is therefore not recommended.
If you are using your pre-commit configuration as part of CI, you should set the environment variable NBWIPERS_CHECK_INSTALL_EXIT_ZERO
which forces this check to pass, since you do not need nbwipers
configured in your CI environment.
Motivation
A working copy of a Jupyter notebook contains:
- Code written by the author.
- Notebook outputs: tables, logs, tracebacks, images, widgets and so on...
- Execution counts.
- Metadata, such as whether cells are collapsed, scrollable etc.
Of these categories of data, only the first — code written by the author — should definitely be tracked by version control, since it is the product of the author's intention and hard work. The other categories of data are subject to change outside of the explicit intentions of the author and are generally noisy from a version control perspective.
Moreover, including notebook outputs in version control
- makes diffs harder to interpret, as they will contain lots of unintended changes.
- increases the risk of a tricky merge conflict if different users run the same cell and get a slightly different result.
- increases the amount of data committed, which can degrade repository performance.
- risks leaking sensitive data.
An effective way to ensure you do not commit problematic parts of your notebooks is to use nbwipers
or nbstripout
as a git filter.
A git filter sits between your actual files and what git sees when you stage and commit your changes. This way, git only sees the transformed version of the file without the problematic elements. At the same time, you do not have to lose them from your local copy.
An exception is when you checkout a branch or do a git pull, which results in changes to the notebook. In this case, your local copy will be replaced by the clean version and you will lose your cell outputs.
Configuration
Configuration is currently done via the tool.nbwipers
section of the pyproject.toml
file.
Most of the command line options can be set per-project in the pyproject.toml
, nbwipers.toml
or .nbwipers.toml
file.
If you use pyroject.toml
, you need to put the configuration under [tool.nbwipers]
.
If you use nbwipers.toml
or .nbwipers.toml
, the configuration needs to be at the top level.
For example you can use extra-keys
to specify additional notebook elements you want to ignore.
If you don't need the python version or the details about the Jupyter Kernel, you can include the following in your pyproject.toml
file:
[tool.nbwipers]
extra-keys = ["metadata.kernelspec", "metadata.language_info.version"]
The equivalent for nbwipers.toml
or .nbwipers.toml
is just
extra-keys = ["metadata.kernelspec", "metadata.language_info.version"]
This can be useful when collaborating, as the precise python version and the name assigned to the kernel are ephemeral and can change from person to person.
Testing Coverage
To test coverage, use the command:
cargo tarpaulin -o stdout -o html -o lcov --engine llvm
Using the llvm
engine means that integration tests contribute to coverage.
Acknowledgements
nbwipers relies on inspiration and code from several projects.
For the projects, whose code was used please see LICENSE
for the third-party notices.
nbstripout
strip output from Jupyter and IPython notebooks
nbstripout is an invaluable tool for working with Jupyter Notebooks in the context of version control. This project forms the basis of the interface and logic of this project and is also the source of the testing examples.
nbstripout-fast
A much faster version of nbstripout by writing it in rust (of course).
nbstripout-fast, like this project, implements the functionality of nbstripout in Rust, while also allowing repo-level configuration in a YAML file.
With nbwipers I hoped to recreate the idea of nbstripout-fast, but with the ability to install as a git filter, and configuration via pyproject.toml
.
ruff
An extremely fast Python linter and code formatter, written in Rust.
Ruff is quickly becoming the linter for python code, thanks to its performance, extensive set of rules and its ease of use. It was a definite source of knowledge for the organisation of the configuration and the file discovery. The schema for Jupyter Notebooks, and some of the file discovery code was adapted from Ruff.
pre-commit
A framework for managing and maintaining multi-language pre-commit hooks.
This repo contains a version of the check-large-files hook, that will not flag notebook files whose clean size is less that the threshold, even if the size on-disk including outputs is greater than the threshold. The logic and interface of the hook was adapted from the pre-commit-hooks repository.
Dependencies
~19–32MB
~505K SLoC