3 releases
0.1.2 | Aug 8, 2024 |
---|---|
0.1.1 | Aug 8, 2024 |
0.1.0 | Aug 8, 2024 |
#1482 in Command line utilities
54 downloads per month
50KB
1K
SLoC
WALL•A
The wall-a
CLI tool is intended to support writing data into a binary format
in the context of a git repository.
Motivation
My initial idea was writing benchmark/profiling data to the git repository, and storing it in a format that would not cause issues for git (mostly).
I wanted to write 1 piece of data per commit, and then wanted some way to get the aggregate data out so that I could maybe visualize or otherwise use the benchmark data.
Design
The tool has two commands:
append
- this command will read JSON data from STDIN and append it to a staging file in a specified "data" directory. If the staging file grows too large, then the contents of the staging file are read, merged together, and then written as in a binary format (CBOR) to a new "archive" file. The archive file has a timestamp as part of the filename, so it is ordered with respect to all previous archive files.read
- this command reads all the archive files in order by filename, merges the values each contains, then reads and merges the staging file values as well. Then it takes the final value and writes it to standard output.
Important to note that the JSON data written by append
is merged with all previous
data when it is read
. The merge function works like:
- For a pair of JSON objects, it recursive merges common keys, otherwise it just takes
the values for non-common keys. For example, merging
{"key": "value1", "some":"other"}
and{"key": "value2", "un":"related"}
gives{"key": "value2", "some":"other", "un":"related"}
. - For a pair of JSON arrays, it concatenates the new value after the old one. For
example, merging
[1, 2, 3]
and[4, 5, 6]
gives[1, 2, 3, 4, 5, 6]
. - For all other combinations, it always takes the newer JSON value
The design is somewhat inspired by https://simonwillison.net/2020/Oct/9/git-scraping/,
I wanted to have git diff
work for the most recent data. However, I didn't want there
to be a huge JSONL file that grew without bound, so as a compromise I added the
idea of the "archive" file.
The "archive" file is just a snapshot of the staging file data, converted to a binary format. This binary file can be much smaller and faster to read than the staging file. The downside is that this file is in binary and doesn't interact with git well. The archive file are only written 1 time, to reduce the number of copies of the file git needs to store in the history.
The staging file is just a newline-delimited JSON file (JSONL). This format is great
for git diff
, since you can easily see the newly added data and the data which was
transferred to the archive file.
Dependencies
~16MB
~270K SLoC