#data-management #object-storage #file-format #data-transformation #file-encryption #metadata #pithos

pithos_lib

Library and components for encrypting / compressing pithos (.pto) files, including specification

4 releases

0.6.1 Nov 19, 2024
0.6.0 Jul 17, 2024
0.5.1 Apr 11, 2024
0.5.0 Mar 15, 2024

#1207 in Cryptography

Download history 3/week @ 2024-08-26 19/week @ 2024-09-16 24/week @ 2024-09-23 6/week @ 2024-09-30 1/week @ 2024-10-21 2/week @ 2024-11-04 157/week @ 2024-11-18 49/week @ 2024-12-02 75/week @ 2024-12-09

281 downloads per month
Used in pithos

MIT/Apache

1.5MB
6.5K SLoC

Rust License CI Codecov Dependency status


Pithos library

A library for creating handling and transforming Pithos files, an object storage optimised file format for Research Data Management (RDM).

Description

The library contains a custom Read/Write component that allows for dynamic transformation data in multiple input and output formats. While initially focused on usage as backend format in the ArunaObjectStorage the scope has significantly widened, this now support transformation from / to Pithos files for a wide variety of options including:

  • Encryption (ChaCha20-Poly1305)
  • Compression (Zstandard)
  • Indexing
  • Metadata handling
  • Bundling of multiple .pto into .tar.gz archives

Guidance

Short guidance for usage of the GenericReadWriter and similarly for the GenericStreamReadWriter custom component. For the formal file specification click here.

An overhauled generic version of customisable data transformer component for the Pithos file format and associated transformation logic. The idea is simple, you implement these simple base trait with your custom data transformation logic:

#[async_trait::async_trait]
pub trait Transformer {
    async fn process_bytes(&mut self, buf: &mut bytes::BytesMut, finished: bool) -> Result<bool>;
}

And afterwards the structs implementing Transformer can be registered in the GenericReadWriter to be plugged between the Read and Write parts of a ReadWriter.

Example:

        let file = b"This is a very very important test".to_vec();
        let mut file2 = Vec::new();

        // Create a new GenericReadWriter
        GenericReadWriter::new_with_writer(file.as_ref(), &mut file2)
            .add_transformer(ZstdEnc::new(1, false))
            .add_transformer(ZstdEnc::new(2, false)) // Double compression because we can
            .add_transformer(
                ChaCha20Enc::new(false, b"wvwj3485nxgyq5ub9zd3e7jsrq7a92ea".to_vec()).unwrap(),
            )
            .add_transformer(
                ChaCha20Enc::new(false, b"99wj3485nxgyq5ub9zd3e7jsrq7a92ea".to_vec()).unwrap(),
            )
            .add_transformer(
                ChaCha20Dec::new(Some(b"99wj3485nxgyq5ub9zd3e7jsrq7a92ea".to_vec())).unwrap(),
            )
            .add_transformer(
                ChaCha20Dec::new(Some(b"wvwj3485nxgyq5ub9zd3e7jsrq7a92ea".to_vec())).unwrap(),
            )
            .add_transformer(ZstdDec::new())
            .add_transformer(ZstdDec::new())
            .add_transformer(Filter::new(Range { from: 0, to: 3 }))
            .process()
            .await
            .unwrap();
        assert_eq!(file2, b"Thi".to_vec());

This example creates a Vec<u8> from a bytes array (implements AsyncRead) and sinks it in another Vec<u8> (impl AsynWrite). In between, custom data transformations can take place.

The example compresses the vector two times with a custom padded Zstandard compression component and afterwards encrypts the result also two times with ChaCha20-Poly1305. Afterwards all steps are reversed resulting in the original data.

Notes for own implementations

The main logic is build around, the process_bytes function.

async fn process_bytes(&mut self, buf: &mut bytes::Bytes, finished: bool, should_flush: bool) -> Result<bool>;

The idea is that your Transformer receives a mutable buffer with bytes that you MUST transform. If you have transformed (either all or via an internal buffer) the data is put back into the buffer for the next transformers process_bytes method. If should_flush is true all internal buffers should be flushed and cleared immediately.

Dependencies

~16–28MB
~447K SLoC