10 releases (4 breaking)

0.5.0 Oct 10, 2024
0.4.0 Sep 19, 2024
0.3.0 Apr 25, 2024
0.2.2 Apr 2, 2024
0.1.2 Jan 31, 2024

#62 in WebAssembly

Download history 279/week @ 2024-07-22 303/week @ 2024-07-29 301/week @ 2024-08-05 453/week @ 2024-08-12 506/week @ 2024-08-19 331/week @ 2024-08-26 403/week @ 2024-09-02 324/week @ 2024-09-09 502/week @ 2024-09-16 763/week @ 2024-09-23 188/week @ 2024-09-30 634/week @ 2024-10-07 1027/week @ 2024-10-14 429/week @ 2024-10-21 396/week @ 2024-10-28 426/week @ 2024-11-04

2,297 downloads per month

Apache-2.0

30KB
415 lines

Rust UDF for Apache Arrow

Crate Docs

Generate RecordBatch functions from scalar functions painlessly with the #[function] macro.

Usage

Add the following lines to your Cargo.toml:

[dependencies]
arrow-udf = "0.3"

Define your functions with the #[function] macro:

use arrow_udf::function;

#[function("gcd(int32, int32) -> int32", output = "eval_gcd")]
fn gcd(mut a: i32, mut b: i32) -> i32 {
    while b != 0 {
        (a, b) = (b, a % b);
    }
    a
}

The generated function can be named with the optional output parameter. If not specified, it will be named arbitrarily like gcd_int32_int32_int32_eval.

You can then call the generated function on a RecordBatch:

let input: RecordBatch = ...;
let output: RecordBatch = eval_gcd(&input).unwrap();

If you print the input and output batch, it will be like this:

 input     output
+----+----+-----+
| a  | b  | gcd |
+----+----+-----+
| 15 | 25 | 5   |
|    | 1  |     |
+----+----+-----+

Fallible Functions

If your function returns a Result:

use arrow_udf::function;

#[function("div(int32, int32) -> int32", output = "eval_div")]
fn div(x: i32, y: i32) -> Result<i32, &'static str> {
    x.checked_div(y).ok_or("division by zero")
}

The output batch will contain a column of errors. Error rows will be filled with NULL in the output column, and the error message will be stored in the error column.

 input     output
+----+----+-----+------------------+
| a  | b  | div | error            |
+----+----+-----+------------------+
| 15 | 25 | 0   |                  |
| 5  | 0  |     | division by zero |
+----+----+-----+------------------+

Struct Types

You can define a struct type with the StructType trait:

use arrow_udf::types::StructType;

#[derive(StructType)]
struct Point {
    x: f64,
    y: f64,
}

Then you can use the struct type in function signatures:

use arrow_udf::function;

#[function("point(float64, float64) -> struct Point", output = "eval_point")]
fn point(x: f64, y: f64) -> Point {
    Point { x, y }
}

Currently struct types are only supported as return types.

Function Registry

If you want to lookup functions by signature, you can enable the global_registry feature:

[dependencies]
arrow-udf = { version = "0.3", features = ["global_registry"] }

Each function will be registered in a global registry when it is defined. Then you can lookup functions from the REGISTRY:

use arrow_schema::{DataType, Field};
use arrow_udf::sig::REGISTRY;

let int32 = Field::new("int32", DataType::Int32, true);
let sig = REGISTRY.get("gcd", &[int32.clone(), int32.clone()], &int32).expect("gcd function");
let output = sig.function.as_scalar().unwrap()(&input).unwrap();

See the example and the documentation for the #[function] macro for more details.

See also the blog post: Simplifying SQL Function Implementation with Rust Procedural Macro.

Dependencies

~11–18MB
~245K SLoC