27 major breaking releases
51.0.0 | Mar 18, 2024 |
---|---|
50.0.0 | Jan 12, 2024 |
49.0.0 | Nov 13, 2023 |
48.0.1 | Nov 13, 2023 |
24.0.0 | Oct 3, 2022 |
#170 in Rust patterns
654,653 downloads per month
Used in 289 crates
(67 directly)
1.5MB
32K
SLoC
The central type in Apache Arrow are arrays, which are a known-length sequence of values
all having the same type. This crate provides concrete implementations of each type, as
well as an Array
trait that can be used for type-erasure.
Building an Array
Most Array
implementations can be constructed directly from iterators or [Vec
]
#
Int32Array::from(vec![1, 2]);
Int32Array::from(vec![Some(1), None]);
Int32Array::from_iter([1, 2, 3, 4]);
Int32Array::from_iter([Some(1), Some(2), None, Some(4)]);
StringArray::from(vec!["foo", "bar"]);
StringArray::from(vec![Some("foo"), None]);
StringArray::from_iter([Some("foo"), None]);
StringArray::from_iter_values(["foo", "bar"]);
ListArray::from_iter_primitive::<Int32Type, _, _>([
Some(vec![Some(1), None, Some(3)]),
None,
Some(vec![])
]);
Additionally ArrayBuilder
implementations can be
used to construct arrays with a push-based interface
#
// Create a new builder with a capacity of 100
let mut builder = Int16Array::builder(100);
// Append a single primitive value
builder.append_value(1);
// Append a null value
builder.append_null();
// Append a slice of primitive values
builder.append_slice(&[2, 3, 4]);
// Build the array
let array = builder.finish();
assert_eq!(5, array.len());
assert_eq!(2, array.value(2));
assert_eq!(&array.values()[3..5], &[3, 4])
Low-level API
Internally, arrays consist of one or more shared memory regions backed by a Buffer
,
the number and meaning of which depend on the array’s data type, as documented in
the Arrow specification.
For example, the type Int16Array
represents an array of 16-bit integers and consists of:
- An optional
NullBuffer
identifying any null values - A contiguous
ScalarBuffer<i16>
of values
Similarly, the type StringArray
represents an array of UTF-8 strings and consists of:
- An optional
NullBuffer
identifying any null values - An offsets
OffsetBuffer<i32>
identifying valid UTF-8 sequences within the values buffer - A values
Buffer
of UTF-8 encoded string data
Array constructors such as PrimitiveArray::try_new
provide the ability to cheaply
construct an array from these parts, with functions such as PrimitiveArray::into_parts
providing the reverse operation.
#
// Create a Int32Array from Vec without copying
let array = Int32Array::new(vec![1, 2, 3].into(), None);
assert_eq!(array.values(), &[1, 2, 3]);
assert_eq!(array.null_count(), 0);
// Create a StringArray from parts
let offsets = OffsetBuffer::new(vec![0, 5, 10].into());
let array = StringArray::new(offsets, b"helloworld".into(), None);
let values: Vec<_> = array.iter().map(|x| x.unwrap()).collect();
assert_eq!(values, &["hello", "world"]);
As Buffer
, and its derivatives, can be created from [Vec
] without copying, this provides
an efficient way to not only interoperate with other Rust code, but also implement kernels
optimised for the arrow data layout - e.g. by handling buffers instead of values.
Zero-Copy Slicing
Given an Array
of arbitrary length, it is possible to create an owned slice of this
data. Internally this just increments some ref-counts, and so is incredibly cheap
let array = Int32Array::from_iter([1, 2, 3]);
// Slice with offset 1 and length 2
let sliced = array.slice(1, 2);
assert_eq!(sliced.values(), &[2, 3]);
Downcasting an Array
Arrays are often passed around as a dynamically typed &dyn Array
or ArrayRef
.
For example, RecordBatch
stores columns as ArrayRef
.
Whilst these arrays can be passed directly to the compute
, csv
, json
, etc... APIs,
it is often the case that you wish to interact with the concrete arrays directly.
This requires downcasting to the concrete type of the array:
// Safely downcast an `Array` to an `Int32Array` and compute the sum
// using native i32 values
fn sum_int32(array: &dyn Array) -> i32 {
let integers: &Int32Array = array.as_any().downcast_ref().unwrap();
integers.iter().map(|val| val.unwrap_or_default()).sum()
}
// Safely downcasts the array to a `Float32Array` and returns a &[f32] view of the data
// Note: the values for positions corresponding to nulls will be arbitrary (but still valid f32)
fn as_f32_slice(array: &dyn Array) -> &[f32] {
array.as_any().downcast_ref::<Float32Array>().unwrap().values()
}
The cast::AsArray
extension trait can make this more ergonomic
fn as_f32_slice(array: &dyn Array) -> &[f32] {
array.as_primitive::<Float32Type>().values()
}
Dependencies
~2.9–9MB
~71K SLoC