#layer #oci #task #builder #manifest #btrfs #linear

app stromatekt

A parallelized OCI builder, on top of runc and btrfs

1 unstable release

0.0.0 Sep 28, 2023

#24 in #btrfs

AGPL-3.0-or-later

165KB
3.5K SLoC

A re-imagined OCI image builder.

  • Take advantage of native snapshot/diff/overlay functionality of filesystems. Cheap calculation of multiple changesets/layers in a build history enable more granular layers.
  • Parallel builds based on a dataflow graph.
  • Selectively add, remove, mix-and-match arbitrary base layers below the current build task. Forget about amalgamation images to support your mixed toolchains, apply tools from multiple pre-built images one after another.
  • Define custom image manifests. Unlocked via flexible build tool layers, manifest files are built by the configuration via 'just another' step in a task. Select layers with your own code logic, cross-build multi-platform images to your hearts content, and more.

How to use

  1. You will need a fresh BTRFs subvolume mounted and owned by your current user. Additionally, unprivileged_userns_clone should be enabled and the kernel compiled with support for userns.

    # mount -t btrfs -o rw,space_cache,user_subvol_rm_allowed,noacl,noatime,subvol=/stromatekt /dev/sdx /home/stromatekt
    btrfs filesystem df /home/stromatekt
    cat /proc/sys/kernel/unprivileged_userns_clone | grep 1
    cat /proc/config.gz  | gunzip -c | grep CONFIG_USER_NS=y
    
  2. Create ~/.config/stromatekt/config.json with the path to the subvolume mount adjusted accordingly. It should look similar to:

    {
    	"btrfs_root": "/home/stromatekt"
    }
    
  3. Prepare the example binary:

    pushd examples/prime && cargo build --release && popd
    
  4. Execute the example build:

    cargo run -- ./examples/parallel-dependency.json --no-dry-run
    

Motivation

docker build is slow. The structure of a Dockerfile only permits a linear sequence of instructions. Moreover, docker compose is even slower. It will send, unpack, repack layers of images and local file system a lot. This can take a significant amount of time. The author has observed builds, with Dockerfile containing a single line of adding one link in the file system, taking >4 minutes. This is unacceptable as development latency. Further, caching of layers is inextricably bad due to the linear sequence logic. Let's address both.

Structure of an OCI file

The main data within an OCI container is an ordered collection of layers. Each layer is essentially a diff of the last, usually in the form of a tar archive. (For slightly surprising reasons, a deletion is encoded as a file with special naming rules).

When running a build, the builder will checkout the layers of the underlying container, run its commands, and finally find the diff to encode into a new layer. The two highly expensive filesystem tasks—checkout and diff—can be implemented much more efficiently if we can utilize the checkpoint and incremental diff logic of the filesystem itself.

Furthermore, this task is probably IO-bound. Meaning, we should seek to perform much of it in parallel wherever possible. Note that the layer sequence of an OCI image is not commutative. However, as long as the task definition itself opts-in by providing a canonical recombination order there shouldn't be any reproducibility problem from creating layers via a different order.

Example:

  • A --(proc0)-> B0 yielding diff C0
  • A --(proc1)-> B1 yielding diff C1
  • => export layers as: [A, C0, C1]

Actually, we could even allow swapping A for a totally unrelated A* as long as the build manifest makes this explicit. For instance, to provide a security patch of an underlying layer. Also, proc0 and proc1 can be executed with entirely different underlying technologies (i.e. one as a x86 process, one a WASI executable).

Planned extensions

  1. Library files for build dependencies and maintainability. Define additional tasks in a separate file, then import specific changesets they define into another specification and let the dataflow resolver figure out a solution.
  2. Reproducibility assertions via hashes, used for incremental builds.

Dependencies

~15–29MB
~441K SLoC