#serialization #numpy #msgpack #ndarray #serde #machine-learning #save

bin+lib msgpack-numpy

A Rust implementation of msgpack-numpy for de-/serializing NumPy scalars and arrays that matches the Python implementation

4 releases

0.1.3 Aug 22, 2024
0.1.2 Aug 21, 2024
0.1.1 Aug 13, 2024
0.1.0 Jul 18, 2024

#313 in Encoding

Download history 43/week @ 2024-07-12 80/week @ 2024-07-19 8/week @ 2024-07-26 120/week @ 2024-08-09 383/week @ 2024-08-16 270/week @ 2024-08-23 246/week @ 2024-08-30 85/week @ 2024-09-06 98/week @ 2024-09-13 24/week @ 2024-09-20 56/week @ 2024-09-27 56/week @ 2024-10-04 53/week @ 2024-10-11 17/week @ 2024-10-18

184 downloads per month

MIT license

69KB
1K SLoC

msgpack-numpy-rs

Crates.io Docs.rs License

This crate does what Python's msgpack-numpy does in Rust, and a lot faster. It serializes and deserializes NumPy scalars and arrays to and from the MessagePack format, in the same serialized formats as the Python counterpart, so they could interoperate with each other. It enables processing NumPy arrays in a different service in Rust through IPC, or saving Machine Learning results to disk (better paired with compression).

Overview

  • It supports bool, u8, i8, u16, i16, f16 (through the half crate), u32, i32, f32, u64, i64, f64.
  • No support for arrays with complex numbers ('c'), byte strings ('S'), unicode strings ('U'), or other non-primitive types as elements. No support for structured/tuple data types ('V'), or object-type data that need pickling ('O') (ref).
  • However, during deserialization, we allow unsupported types to be deserialized as the Unsupported variant. This ensures deserialization can continue and the supported portions of data can be used.
  • Scalars and arrays are represented as separate types, each of which being an enum of different element type variants. They come with convenient conversion methods (backed by the num-traits crate) to the desired target primitive types. Example: f16, f32, f64 can all be converted to f64, or f16 with loss. This allows flexibility during deserialization, without explicit pattern matching and conditional conversion. It would be similar to NumPy's .astype(np.float64) / .astype(np.float16). Notably, bool is convertible to numeric types as (0, 1), but not from numeric types using these methods. Of course, you can do your own conversion after matching with the Bool variant.
  • Arrays use the ndarray crate, and have dynamic shapes. This enables users to leverage Rust's numeric ecosystem for the deserialized arrays.
  • Array handling using CowNDArray could be zero-copy when array buffers in the serialized slice have good alignment, although MessagePack doesn't guarantee this.
  • It depends on serde. In addition, it makes sense to use a correct MessagePack implementation, such as rmp-serde, which is used in the examples below, although it doesn't need to be a dependency, due to serde's design.

Motivation

There hasn't been consensus on a good format that is both flexible and efficient for serializing NumPy arrays. They are unique in that they are blocks of bytes in nature, but also have numeric types and shapes. Programmers working on Machine Learning problems found MessagePack to have interesting properties. It is compact with a type system, and has a wide range of language support. The package msgpack-numpy provides de-/serialization for NumPy arrays, standalone or enclosed in arbitrary organizational depths, to be sent over the network, or saved to disk, in a compact format.

If one looks for a more production-oriented, performant format, they might consider Apache Arrow, Parquet, or Protocol Buffers. However, these formats are not as flexible as MessagePack when you need to store intermediate Machine Learning results. In practice, MessagePack with Numpy array support can be quite a good choice for many of these use cases.

This Rust version aims to provide a faster alternative to the Python version, with the same serialized formats as the Python counterpart so they could interoperate with each other. You could use this as a building block for your own Machine Learning pipeline in Rust, or as a way to communicate between Python and Rust.

Examples

use std::fs::File;
use std::io::Read;
use msgpack_numpy::NDArray;

fn main() {
    let filepath = "tests/data/ndarray_bool.msgpack";
    let mut file = File::open(filepath).unwrap();
    let mut buf = Vec::new();
    file.read_to_end(&mut buf).unwrap();
    let deserialized: NDArray = rmp_serde::from_slice(&buf).unwrap();

    match &deserialized {
        NDArray::Bool(array) => {
            println!("{:?}", array);
        }
        _ => panic!("Expected NDArray::Bool"),
    }

    // returns an Option, None if conversion is not possible
    let arr = deserialized.into_u8_array().unwrap();
    println!("{:?}", arr);
}

Please see more in examples/.

Benchmarks

All benchmarks were done with 1 CPU core on a Ubuntu 22.04 instance. CPUs: Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz. The Rust version was compiled in release mode. We are only benchmarking the serialization and deserialization of arrays, in memory. See benches/ for the benchmark code.

This applies to the owned NDArray.

Array Type Array Size Arrays Operation Python (ms) Rust (ms) Speedup
f32 1000 10000 Serialize 56.4 17.1 3.3x
Deserialize 26.1 18.9 1.4x
100 100000 Serialize 226.1 27.1 8.3x
Deserialize 199.3 50.5 3.9x
f16 1000 10000 Serialize 33.5 4.0 8.5x
Deserialize 21.2 5.2 4.1x
100 100000 Serialize 198.9 12.1 16.5x
Deserialize 195.2 29.5 6.6x

The Rust implementation shows significant performance improvements over Python in all cases, with particularly dramatic speedups for small array serialization. The Python version's de-/serialization logic is written in C through NumPy, but small arrays reduce this benefit because each array is a Python object. Notably, the Python version deserializes faster than serializing, while the Rust version serializes faster than deserializing. This range of array sizes is typical for Machine Learning use cases, such as feature embeddings, so Rust will be able to help out when performance is needed.

Zero-Copy Deserialization (when Good Alignment)

For the above arrays, the array buffers always seem to be misaligned during deserialization, so we can't just borrow the data from the serialized slice as the targeted typed array, but instead pay for extra allocation. This is because the MessagePack format doesn't guarantee alignment.

In most cases however, there are good chances of alignment, and we could borrow the array buffer data directly when that happens. This is demonstrated in the following benchmarks. We choose CowNDArray, shape (1024, 2048), 10 arrays each time for demonstration.

Data Type Operation Python (ms) Rust (ms) Speedup
f16 Serialize 42.8 23.4 1.8x
Deserialize (NDArray) 21.6 20.4 1.1x
Deserialize (CowNDArray) - 10.5 2.1x
f32 Serialize 87.8 43.5 2.0x
Deserialize (NDArray) 44.2 41.4 1.1x
Deserialize (CowNDArray) - 34.5 1.3x

Deserialization time went down! For f16, it's about half the chance for good alignment, and 1/4 for f32. The amortized cost of allocation is now lower, and we can see the benefit of zero-copy deserialization. The shortcoming is, CowNDArray only supports rmp_serde::from_slice (consuming from a slice that's fully in memory), but not rmp_serde::from_read (consuming from a reader in a streaming way). So you need to keep the serialized bytes (the compiler will check).

If you really want complete zero-copy deserialization, you should try some other format, like Apache Arrow.

Notes

Scalar Type

There is not a good reason to serialize using Scalar, because you end up representing primitive types with a lot of metadata. This type exists for compatibility reasons - it helps deserialize scalars already serialized this way.

Dependency on ndarray

This crate uses types from ndarray in its public API. ndarray is re-exported in the crate root so that you do not need to specify it as a direct dependency.

Furthermore, this crate is compatible with multiple versions of ndarray and therefore depends on a range of semver-incompatible versions, currently >=0.15, <0.17. Cargo does not automatically choose a single version of ndarray by itself if you depend directly or indirectly on anything but that exact range. In other words, this crate will get 0.16.1 as its own, separate dependency, even if you pin ndarray to 0.15.6 in your own project. This might come as a surprise, and you will get compilation errors like:

     = note: `ArrayBase<CowRepr<'_, f32>, Dim<IxDynImpl>>` and `ArrayBase<CowRepr<'_, f32>, Dim<IxDynImpl>>` have similar names, but are actually distinct types
note: `ArrayBase<CowRepr<'_, f32>, Dim<IxDynImpl>>` is defined in crate `ndarray`
    --> /home/ubuntu/.cargo/registry/src/index.crates.io-6f17d22bba15001f/ndarray-0.15.6/src/lib.rs:1268:1
     |
1268 | pub struct ArrayBase<S, D>
     | ^^^^^^^^^^^^^^^^^^^^^^^^^^
note: `ArrayBase<CowRepr<'_, f32>, Dim<IxDynImpl>>` is defined in crate `ndarray`
    --> /home/ubuntu/.cargo/registry/src/index.crates.io-6f17d22bba15001f/ndarray-0.16.1/src/lib.rs:1280:1
     |
1280 | pub struct ArrayBase<S, D>
     | ^^^^^^^^^^^^^^^^^^^^^^^^^^
     = note: perhaps two different versions of crate `ndarray` are being used?

It can therefore be necessary to manually unify these dependencies. For example, if you specify the following dependencies

msgpack-numpy = "0.1.3"
ndarray = "0.15.6"

this will currently depend on both version 0.15.6 and 0.16.1 of ndarray by default even though 0.15.6 is within the range >=0.15, <0.17. To fix this, you can run

cargo update --package ndarray:0.16.1 --precise 0.15.6

to achieve a single dependency on version 0.15.6 of ndarray. Check your lock file to verify that this worked.

License

This project is licensed under the MIT license.

Dependencies

~2.7–3.5MB
~74K SLoC