1 unstable release

0.1.0 Feb 28, 2019

#689 in Compression

MIT/Apache

86KB
403 lines

taro

taro

Docs License

Taro extract is a delicious flavoring commonly used to make purple bubble tea. It's also a command-line tool you can use to extract a tar archive in-place.

Normal tar extraction requires approximately twice the disk space of the original archive. This tool can be used with only 512 extra bytes of disk space, no matter how large the original archive is.

usage

To extract the file recipe.tar, simply run taro recipe.tar at the command line. recipe.tar will be deleted, and all of its contents will be created in the current working directory. Only an extra 512 bytes of disk space will be used in the process.

obligatory warning

There are many different tar implementations, each with slightly different ways of handling data. This tool is modeled to be compatible with GNU tar, but it has not been used as extensively as the original GNU tar implementation. It is possible that some tar archives may not be extracted correctly. Removal of the original archive is a deliberate feature, so be warned that YOU MAY LOSE YOUR DATA. Be responsible and keep backups. If you can find an example of a tar archive that cannot be properly extracted using this tool, please file an issue or submit a PR.

Note that in particular, character-special files, block-special files, and FIFOs currently will not be recreated using taro.

how it works

Normal tar archives are formatted to make single-pass extractions simple. For example, directories are located towards the beginning of the archive, whereas files within those directories are located towards the end. Unfortunately, this makes it very challenging to delete the file in-place during extraction. In Linux, files are read forwards and writes must either overwrite the original file or be appended at the end of the file. Bytes cannot be removed from the beginning of a file; they must be truncated from the end.

Luckily, these properties are enough to use the tar archive as a stack with simple push/pop operations. The standard 512-byte blocksize of the tar archive format works even better in our favor for this purpose. The first step taro takes is to reverse the order of the blocks in the original archive file by pushing them onto two reversed .rat files, one for the headers and one for the file contents, and deleting them from the .tar file one block at a time. Once the .rat files are fully constructed, blocks are popped from their ends and appended to the proper extracted locations in the filesystem.

can't you just add more disk space??

I was the system administrator for a team at the Student Cluster Competition at SC18, where we had to build a 3kW computing cluster and benchmark its performance on unknown datasets. At the competition, the datasets were made available in a 190GB tar archive. For whatever reason, we had a four node cluster, with 256GB drives on each node. With intense care, I managed to clear up enough space on a single node to store the entire archive, and downloaded it from the competition FTP server through a Raspberry Pi router. It was here that I discovered tar has no way to perform an in-place extraction. I could selectively extract files, but the original archive would still take up the majority of the drive, and we needed the entire dataset before we could run anything with it. Due to competition rules, we were not allowed to add external drives, or change our network configuration. We attempted to transfer it to another machine, extract it there, and then transfer it back. Unfortunately, the network speed was a limiting factor, and we eventually ran out of time.

After this ordeal, I was inspired to build an in-place tar archive extractor. It may not be necessary in most cases, but it may save someone in a time of need.

Dependencies

~3–13MB
~133K SLoC