#web-archive #warc #wacz #cdxj #save-the-internet

wacksy

Experimental library for writing WACZ achives

9 releases

Uses new Rust 2024

0.2.0 Nov 20, 2025
0.1.3 Oct 31, 2025
0.1.1 Sep 30, 2025
0.0.2 Aug 6, 2025
0.0.1-alpha Apr 5, 2025

#848 in Encoding

MIT license

42KB
650 lines

Wacksy

Software Heritage Archive Deps.rs Crate Dependencies (latest) Crates.io Total Downloads

An experimental Rust library for reading and writing ᴡᴀᴄᴢ files.

Install

With cargo installed, run the following command in your project directory:

cargo add wacksy

Example

This library provides two main ᴀᴘɪ functions. from_file() takes a ᴡᴀʀᴄ file and returns a structured representation of a ᴡᴀᴄᴢ object. as_zip_archive() takes a ᴡᴀᴄᴢ object and zips it up to a byte array using rawzip.

fn main() -> Result<(), Box<dyn Error>> {
    let warc_file_path = Path::new("example.warc.gz"); // set path to your ᴡᴀʀᴄ file
    let wacz_object = WACZ::from_file(warc_file_path)?; // index the ᴡᴀʀᴄ and create a ᴡᴀᴄᴢ object
    let zipped_wacz: Vec<u8> = wacz_object.as_zip_archive()?; // zip up the ᴡᴀᴄᴢ
    fs::write("example.wacz", zipped_wacz)?; // write out to file
    Ok(())
}

See the documentation for more details.

Background

According to Ed Summers, a ᴡᴀᴄᴢ file is "really just a ᴢɪᴘ file that contains ᴡᴀʀᴄ data and metadata at predicatble file locations."[^code4lib_talk]

The example in the spec outlines what a ᴡᴀᴄᴢ file should contain:

archive
└── data.warc.gz
datapackage.json
datapackage-digest.json
indexes
└── index.cdx.gz
pages
└── pages.jsonl

[^code4lib_talk]: For more discussion of the concept, see the talk "Web Archives in Digital Repositories" by Ilya Kremer and Ed Summers at Code4Lib 2022.

Similar libraries

License

MIT © Bodleian Libraries and contributors

Dependencies

~2.5MB
~47K SLoC