20 major breaking releases

50.0.0 Jan 12, 2024
49.0.0 Nov 13, 2023
48.0.1 Nov 13, 2023
47.0.0 Sep 22, 2023
30.0.1 Jan 8, 2023

#123 in Data structures

Download history 37792/week @ 2023-11-06 45652/week @ 2023-11-13 34775/week @ 2023-11-20 45615/week @ 2023-11-27 40319/week @ 2023-12-04 41343/week @ 2023-12-11 42856/week @ 2023-12-18 34261/week @ 2023-12-25 46073/week @ 2024-01-01 57083/week @ 2024-01-08 57114/week @ 2024-01-15 69797/week @ 2024-01-22 84384/week @ 2024-01-29 89036/week @ 2024-02-05 96363/week @ 2024-02-12 79170/week @ 2024-02-19

353,018 downloads per month
Used in 212 crates (5 directly)

Apache-2.0

1.5MB
33K SLoC

A comparable row-oriented representation of a collection of Array.

[Row]s are normalized for sorting, and can therefore be very efficiently compared, using memcmp under the hood, or used in non-comparison sorts such as radix sort. This makes the row format ideal for implementing efficient multi-column sorting, grouping, aggregation, windowing and more, as described in more detail in this blog post.

For example, given three input Array, RowConverter creates byte sequences that compare the same as when using lexsort.

   ┌─────┐   ┌─────┐   ┌─────┐
   │     │   │     │   │     │
   ├─────┤ ┌ ┼─────┼ ─ ┼─────┼ ┐              ┏━━━━━━━━━━━━━┓
   │     │   │     │   │     │  ─────────────▶┃             ┃
   ├─────┤ └ ┼─────┼ ─ ┼─────┼ ┘              ┗━━━━━━━━━━━━━┛
   │     │   │     │   │     │
   └─────┘   └─────┘   └─────┘
               ...
   ┌─────┐ ┌ ┬─────┬ ─ ┬─────┬ ┐              ┏━━━━━━━━┓
   │     │   │     │   │     │  ─────────────▶┃        ┃
   └─────┘ └ ┴─────┴ ─ ┴─────┴ ┘              ┗━━━━━━━━┛
    UInt64      Utf8     F64

          Input Arrays                          Row Format
    (Columns)

Rows must be generated by the same RowConverter for the comparison to be meaningful.

Basic Example


let a1 = Arc::new(Int32Array::from_iter_values([-1, -1, 0, 3, 3])) as ArrayRef;
let a2 = Arc::new(StringArray::from_iter_values(["a", "b", "c", "d", "d"])) as ArrayRef;
let arrays = vec![a1, a2];

// Convert arrays to rows
let converter = RowConverter::new(vec![
    SortField::new(DataType::Int32),
    SortField::new(DataType::Utf8),
]).unwrap();
let rows = converter.convert_columns(&arrays).unwrap();

// Compare rows
for i in 0..4 {
    assert!(rows.row(i) <= rows.row(i + 1));
}
assert_eq!(rows.row(3), rows.row(4));

// Convert rows back to arrays
let converted = converter.convert_rows(&rows).unwrap();
assert_eq!(arrays, converted);

// Compare rows from different arrays
let a1 = Arc::new(Int32Array::from_iter_values([3, 4])) as ArrayRef;
let a2 = Arc::new(StringArray::from_iter_values(["e", "f"])) as ArrayRef;
let arrays = vec![a1, a2];
let rows2 = converter.convert_columns(&arrays).unwrap();

assert!(rows.row(4) < rows2.row(0));
assert!(rows.row(4) < rows2.row(1));

// Convert selection of rows back to arrays
let selection = [rows.row(0), rows2.row(1), rows.row(2), rows2.row(0)];
let converted = converter.convert_rows(selection).unwrap();
let c1 = converted[0].as_primitive::<Int32Type>();
assert_eq!(c1.values(), &[-1, 4, 0, 3]);

let c2 = converted[1].as_string::<i32>();
let c2_values: Vec<_> = c2.iter().flatten().collect();
assert_eq!(&c2_values, &["a", "f", "c", "e"]);

Lexsort

The row format can also be used to implement a fast multi-column / lexicographic sort

fn lexsort_to_indices(arrays: &[ArrayRef]) -> UInt32Array {
    let fields = arrays
        .iter()
        .map(|a| SortField::new(a.data_type().clone()))
        .collect();
    let converter = RowConverter::new(fields).unwrap();
    let rows = converter.convert_columns(arrays).unwrap();
    let mut sort: Vec<_> = rows.iter().enumerate().collect();
    sort.sort_unstable_by(|(_, a), (_, b)| a.cmp(b));
    UInt32Array::from_iter_values(sort.iter().map(|(i, _)| *i as u32))
}

Dependencies

~2.9–9.5MB
~70K SLoC