14 releases

0.6.6	Feb 28, 2024
0.6.4	Dec 5, 2023
0.6.0	Aug 28, 2023
0.5.2	Feb 13, 2023
0.4.0	Nov 26, 2022

#2558 in Command line utilities

247 downloads per month

GPL-3.0 license

44MB
21K SLoC

xvc

Manage your data next to code in Git repositories and run commands when they change.

⌛ Why Xvc?

You have image, audio, media, document or asset files to track/version/backup along with the code, but don't want to copy that huge data to all Git clones.
You want to manage files in multiple locations with different subsets, some (e.g. training data) being read-only and some (e.g. models, executables) change frequently, all versioned along with the code.
You want to store this data in local, Rsync, or S3-compatible cloud storages to share along the repository.
You want to specify commands that run when only input data changes, define pipelines with steps that run when only their dependencies change.
You want to define these dependencies with files, globs spanning multiple files, text file lines defined by ranges or regexes, URLs, parameters in the YAML or JSON files, SQLite queries or any command that produces output.

🔽 Installation

You can get the binary files for Linux, macOS, and Windows from releases page. Extract and copy the file to your $PATH.

Alternatively, if you have Rust installed, you can build xvc:

$ cargo install xvc

If you want to use Xvc with Python console and Jupyter notebooks, you can also install it with pip:

$ pip install xvc

Note that pip installation doesn't make xvc available as a shell command. Please see xvc.py for details.

Completions

Xvc supports dynamic completions for bash, zsh, elvish, fish and powershell. For example, run the following to add completions for bash:

echo "source <(COMPLETE=bash xvc)" >> ~/.bashrc

See completions section in the docs for others.

🚀 Initialize a directory for Xvc

$ xvc init

This command initializes the .xvc/ directory and adds a .xvcignore file for specifying paths you wish to hide from Xvc.

💡 Git is not required to run Xvc. However running Xvc with Git is usually a good idea. Xvc can stage/commit metadata files (under .xvc/) used to track binary files and you can use branches for versioning as well. By default, you won't have to deal with Git commands to commit these metadata files. Xvc can manage the files it updates and hides your binary files from Git by default.

If you don't want to use Xvc with Git, use --no-git option when initializing.

👣 Track binary files

Add your data files and directories for tracking:

$ xvc file track my-data/

This command calculates content hashes for data (using BLAKE-3, by default) and records them. Files are moved to content-addressed directories under .xvc/b3. Then they are copied to the workspace.

💡Tip: You can specify different recheck (checkout) methods for files and directories depending on your use case. Symlinks and hardlinks to the files under Xvc cache don't consume additional space but they are readonly. You can also use (copy-on-write) reflinks if your file system supports it and Xvc is built with reflink feature.

🫧 Checkout a subset of files as symlinks

You can copy and recheck (checkout) subsets of files from Xvc cache as symlinks to create multiple views. This is useful when you need a read-only access that won't consume additional space.

$ xvc file copy my-data/ another-view-to-my-data/
$ xvc file recheck another-view-to-my-data/ --as symlink

💡 xvc file copy and xvc file move doesn't require file contents to be available. Xvc works only with their metadata and you can organize files without their content copied to workspace or cache.

💡 If you installed completions to your shell, Xvc completes file names even if they are not available in your local paths.

🌁 Send files to the cloud services

Configure a cloud storage to share the files you track with Xvc.

$ xvc storage new s3 --name my-storage --region us-east-1 --bucket-name xvc

You can send the files to this storage.

$ xvc file send --to my-storage

You can also send a subset of the files.

$ xvc file send 'my-data/training/*' --to my-storage

Xvc supports external directories, Rsync, AWS S3, Google Cloud Storage, MinIO, Cloudflare R2, Wasabi, Digital Ocean Spaces. Please create an issue if you want Xvc to support another cloud storage service.

💡 Xvc also supports any command to upload/download files. If your favorite service is not listed or you want to use another tool (s5cmd, rclone, etc.), you can specify a generic storage by supplying shell commands to upload and download.

📌 Important: Xvc never stores credentials to your connections and expects them to be available in the environment. It never makes network requests (for tracking, statistics, etc.) without your knowledge. You can compile without cloud connection support in case you want to make sure that it makes no connections to outside services.

🪣 Get files from cloud services

When you (or someone else) want to access these files later, you can clone the Git repository and get the files from the storage.

$ git clone https://example.com/my-machine-learning-project
Cloning into 'my-machine-learning-project'...

$ cd my-machine-learning-project
$ xvc file bring my-data/ --from my-storage

This approach ensures convenient access to files from the shared storage when needed.

💡Tip: You don't have to reconfigure the storage after cloning, but you need to have valid credentials as environment variables to access the storage. Xvc never stores any credentials.

🫖 Share files from cloud storages for a limited time

You can share Xvc tracked files from S3 compatible storages for a specified period.

$ xvc file share --storage my-storage dir-0001/file-0001.bin --duration 1h
https://my-storage.s3.eu-central-1.amazonaws.com/xvc....

You can share the link with others and they will be able to access to the file hour. The default period is 24 hours.

🥤Create a data pipeline

Suppose you have a script to preprocess files in a directory and you want to run this when the files in my-data/train directory changes. We first define a step in the pipeline that will run the script.

$ xvc pipeline step new --step-name preprocess --command 'python3 src/preprocess.py'

Each command is associated with a step and each step has a command.

🔗 Add a dependency to a pipeline step

When we want to create a dependency for a command, we use [xvc pipeline step dependency][xvc-pipeline-step-dependency] command with various parameters.

We want to define to dependencies for the preprocess step we created previously. We'll make preprocess step to depend on:

The src/preprocess.py source file itself, so when we change the script, we'll run the step again

$ xvc pipeline step dependency --step-name preprocess --file src/preprocess.py

data/raw/*.jpg files that the script works on.

$ xvc pipeline step dependency -s preprocess --glob 'data/raw/*jpg'

⚠️ Most of the shells expand globs before running the command, so you need to quote glob to pass these as strings without expansion. Xvc expands these globs itself.

🛝 Run pipeline

After you define the pipeline, you can run it by:

$ xvc pipeline run
[DONE] preprocess (python3 src/preprocess.py)
[OUT] [preprocess] 
...

[DONE] preprocess (python3 src/preprocess.py)

💡 Xvc runs pipeline steps in parallel if they are not interdependent. You can specify the maximum number of parallel processes.

🪡 Add fine grained dependencies to steps

Xvc allows many kinds of dependencies:

Steps can explicitly depend on other steps when they are required to run serially.
Steps can depend on single files or groups of files defined by globs. For globs, you can also get which files are added, deleted or updated with glob-items.

💡 Similar to Git, Xvc doesn't track directories per se. You can define glob dependencies that describe files in directory like dir/* when you want to track all files in in.
You can specify steps to depend only to a subset of lines in a file with line ranges or regular expressions. You can also get which lines are added, deleted or updated with more granular line-items or regex-items dependencies.
If you track (hyper)parameters for building/model training process in JSON or YAML files, you can specify steps to depend on these parameters.
If you want your steps to run when an HTTP(S) URL's content change, you can specify this with URL dependencies
If you want your step to run when the output from an SQLite query change, you can specify it with SQLite dependencies.
If none of the dependency types are fit for your needs, you can also specify a command that will be run to check if a step is invalidated.

🖇️ Example to add a dependency when only certain lines in a file change

Suppose you have a list of IQ scores in a file.

Ada Harris,128
Alan Thompson,125
Brian Shaffer,122
Brian Wilson,94
Dr. Brittany Chang,103
Brittany Smith,104
David Brown,113
Emily Davis,97
Grace White,130
James Taylor,101
Dr. Jane Doe,105
Jessica Lee,102
John Smith,110
Laura Martinez,110
Dr. Linus Martin,118
Mallory Johnson,105
Mallory Payne MD,99
Margaret Clark,122
Michael Johnson,92
Robert Anderson,105
Sarah Wilson,104
Sherry Brown,115
Sherry Leonard,117
Susan Davis,107
Dr. Susan Swanson,132

We're only interested in the IQ scores of those with Dr. in front of their names. Let's create a regex search dependency to run a command when only a line with a Dr. title is added to the file.

Our command will be collecting all lines with an initial Dr. to another file.

$ xvc pipeline step new --step-name dr-iq --command 'echo "${XVC_ADDED_REGEX_ITEMS}" >> dr-iq-scores.csv '
$ xvc pipeline step dependency --step-name dr-iq --regex-items 'iq-scores.csv:/^Dr\..*'

The first line specifies a command, when run writes ${XVC_ADDED_REGEX_ITEMS} environment variable to dr-iq-scores.csv file.

The second line specifies the dependency which will also populate the ${XVC_ADDED_REGEX_ITEMS} environment variable in the command.

Some dependency types like regex items, line items and glob items inject environment variables to the shells running the step commands. If you have thousands of files specified by a glob, but want to run a script only on the added files after the last run, you can use these environment variables.

When you run the pipeline, a file named dr-iq-scores.csv will be created.

$ xvc pipeline run
[DONE] dr-iq (echo "${XVC_ADDED_REGEX_ITEMS}" >> dr-iq-scores.csv )

$ cat dr-iq-scores.csv
Dr. Brittany Chang,103
Dr. Jane Doe,105
Dr. Linus Martin,118
Dr. Susan Swanson,132

When the file changes, e.g. another line matching the dependency regex added to the iq-scores.csv file, the command will add to dr-iq-scores.csv file.

$ zsh -cl 'echo "Dr. John Doe,123" >> iq-scores.csv'

$ xvc pipeline run
[DONE] dr-iq (echo "${XVC_ADDED_REGEX_ITEMS}" >> dr-iq-scores.csv )

$ cat dr-iq-scores.csv
Dr. Brian Shaffer,122
Dr. Brittany Chang,82
Dr. Mallory Payne MD,70
Dr. Sherry Leonard,93
Dr. Susan Swanson,81
Dr. John Doe,123

Note that, ${XVC_ADDED_REGEX_ITEMS} has only the added lines, not all of the lines the regex match. So, we can just work on the added elements, without rerunning the commands for all matching elements.

🛃 Export, edit and import a pipeline with YAML or JSON files

Unlike some other tools, Xvc doesn't require (or allow) to specify pipelines in YAML files. Nevertheless, you can export and import the pipeline to JSON or YAML to edit in your editor. You can fix typos in commands, remove steps completely, or duplicate the pipeline with a new name this way.

$ xvc pipeline export --file my-pipeline.json

$ cat my-pipeline.json
{
  "name": "default",
  "steps": [
    {
      "command": "python3 -m pip install --quiet --user -r requirements.txt",
      "dependencies": [
        {
          "File": {
            "content_digest": {
              "algorithm": "Blake3",
              "digest": [
                43,
                86,
                244,
                111,
                13,
                243,
                28,
                110,
                140,
                213,
                105,
                20,
                239,
                62,
                73,
                75,
                13,
                146,
                82,
                17,
                148,
                152,
                66,
                86,
                154,
                230,
                154,
                246,
                213,
                214,
                40,
                119
              ]
            },
            "path": "requirements.txt",
            "xvc_metadata": {
              "file_type": "File",
              "modified": {
                "nanos_since_epoch": [..],
                "secs_since_epoch": [..]
              },
              "size": 14
            }
          }
        }
      ],
      "invalidate": "ByDependencies",
      "name": "install-deps",
      "outputs": []
    },
    {
      "command": "python3 generate_data.py",
      "dependencies": [
        {
          "Step": {
            "name": "install-deps"
          }
        }
      ],
      "invalidate": "ByDependencies",
      "name": "generate-data",
      "outputs": []
    },
    {
      "command": "echo /"${XVC_ADDED_REGEX_ITEMS}/" >> dr-iq-scores.csv ",
      "dependencies": [
        {
          "RegexItems": {
            "lines": [
              "Dr. Brian Shaffer,122",
              "Dr. Susan Swanson,81",
              "Dr. Brittany Chang,82",
              "Dr. Mallory Payne MD,70",
              "Dr. Sherry Leonard,93",
              "Dr. Albert Einstein,144"
            ],
            "path": "iq-scores.csv",
            "regex": "^Dr//..*",
            "xvc_metadata": {
              "file_type": "File",
              "modified": {
                "nanos_since_epoch": [..],
                "secs_since_epoch": [..]
              },
              "size": 19021
            }
          }
        }
      ],
      "invalidate": "ByDependencies",
      "name": "dr-iq",
      "outputs": [
        {
          "File": {
            "path": "dr-iq-scores.csv"
          }
        }
      ]
    },
    {
      "command": "python3 visualize.py",
      "dependencies": [
        {
          "File": {
            "content_digest": null,
            "path": "dr-iq-scores.csv",
            "xvc_metadata": null
          }
        }
      ],
      "invalidate": "ByDependencies",
      "name": "visualize",
      "outputs": []
    }
  ],
  "version": 1,
  "workdir": ""
}

After you edit the file with changes, you can import the file to check its consistency and update the pipeline definition.

$ xvc pipeline import --file my-pipeline.json --overwrite

🎋 Visualize a pipeline in Graphviz or Mermaid

You can get the pipeline in Graphviz DOT format to convert to an image.

$ zsh -cl 'xvc pipeline dag --format graphviz | dot -opipeline.png'

You can also ask for a mermaid diagram;

xvc pipeline dag --format mermaid
flowchart TD
    n0["preprocess"]
    n1["data/*"] --> n0
    n2["train"]
    n0["preprocess"] --> n2

You can embed this output in Markdown files, Github PRs or Jupyter notebooks.

flowchart TD
    n0["preprocess"]
    n1["data/*"] --> n0
    n2["train"]
    n0["preprocess"] --> n2

Please check docs.xvc.dev for documentation.

🤟 Big Thanks

xvc stands on the following crates:

Xvc has a deep CLI that has subcommands of subcommands (e.g. xvc storage new s3), and all these work with minimum bugs thanks to clap. With its dynamic completion support through clap_complete, Xvc can complete almost anything in your shell.
serde allows all data structures to be stored in text files. Special thanks from xvc-ecs for serializing components in an ECS with a single line of code.
Xvc processes files in parallel with pipelines and parallel iterators thanks to crossbeam and rayon.
Thanks to strum, Xvc uses enums extensively and converts almost everything to typed values from strings.
Xvc uses rust-s3 to connect to S3 and compatible storage services. It employs excellent tokio for fast async Rust. These cloud storage features can be turned off thanks to Rust conditional compilation.
Without implementations of BLAKE3, BLAKE2, SHA-2 and SHA-3 from Rust crypto crate, Xvc couldn't detect file changes so fast.
Xvc handles Git operations through calling the Git binary and (more and more) with gix.
trycmd is used to run all example commands in this file, reference, and how-to documentation at every PR. It makes sure that the documentation is always up-to-date and shown commands work as described. We start development by writing documentation and implementing them thanks to trycmd.
Many thanks to small and well built crates, reflink, relative-path, path-absolutize, fast-glob for file system and glob handling.
Thanks to sad_machine for providing a State Machine implementation that I used in xvc pipeline run. A DAG composed of State Machines made running pipeline steps in parallel with a clean separation of process states.
Thanks to thiserror and anyhow for making error handling a breeze. These two crates make me feel I'm doing something good for the humanity when handling errors.
Xvc is split into many crates and owes this organization to cargo workspaces.

And, biggest thanks to Rust designers, developers and contributors. It's a fabulous language and environment to work with.

🚁 Support

If you found a bug, please create an issue.
You can use discussions to ask questions. I'll answer as much as possible. Thank you.
I don't follow any other sites regularly. You can also reach me at emre@xvc.dev

👐 Contributing

Star this repo. I feel very happy for every star and send my best wishes to you. That's a certain win to spend your two seconds for me. Thanks.
Use xvc. Tell me how it works for you, read the documentation, report bugs, discuss features.
Please note that I don't accept large code PRs. Please open an issue to discuss your idea and write/modify documentation before sending a PR. I'm happy to discuss and help you to implement your idea.

📜 License

Xvc is licensed under the GNU GPL 3.0 License.

In the future, some crates can be licensed with other licenses for easier integration. If you want to use the some crates in your project with other licenses, please contact me from emre@xvc.dev

Any contribution to Xvc is assumed to be aware that licenses can be changed.

🌦️ Future and Maintenance

I'm using Xvc daily for repositories up to 2TB and I'm happy with it. Tracking all my files with Git via arbitrary servers and cloud providers is something I always need. I'm happy to improve and maintain it as long as I use it.

Given that I'm working on this for the last three years for pure technical bliss, you can expect me to work on it more.

⚠️ Disclaimer

This software is fresh and ambitious. Although I use it and test it close to real-world conditions, it didn't go under the test of time. Please backup.

Dependencies

~74MB
~1.5M SLoC