6 releases (3 breaking)

0.4.0 Oct 10, 2024
0.3.0 Sep 19, 2024
0.2.2 Jun 24, 2024
0.2.1 May 23, 2024
0.1.0 Apr 25, 2024

#371 in Development tools

Download history 923/week @ 2024-09-23 255/week @ 2024-09-30 771/week @ 2024-10-07 876/week @ 2024-10-14 565/week @ 2024-10-21 1106/week @ 2024-10-28 873/week @ 2024-11-04 368/week @ 2024-11-11 759/week @ 2024-11-18 523/week @ 2024-11-25 427/week @ 2024-12-02 382/week @ 2024-12-09 448/week @ 2024-12-16 320/week @ 2024-12-23 417/week @ 2024-12-30 474/week @ 2025-01-06

1,670 downloads per month

Apache-2.0

61KB
983 lines

Python UDF for Apache Arrow

Crate Docs

Notice: Python 3.12 is required to run this library. If python3 is not 3.12, please set the environment variable PYO3_PYTHON=python3.12.

Add the following lines to your Cargo.toml:

[dependencies]
arrow-udf-python = "0.2"

Create a Runtime and define your Python functions in string form. Note that the function name must match the one you pass to add_function.

use arrow_udf_python::{CallMode, Runtime};

let mut runtime = Runtime::new().unwrap();
let python_code = r#"
def gcd(a: int, b: int) -> int:
    while b:
        a, b = b, a % b
    return a
"#;
let return_type = arrow_schema::DataType::Int32;
let mode = CallMode::ReturnNullOnNullInput;
runtime.add_function("gcd", return_type, mode, python_code).unwrap();

You can then call the python function on a RecordBatch:

let input: RecordBatch = ...;
let output: RecordBatch = runtime.call("gcd", &input).unwrap();

The python code will be run in an embedded CPython 3.12 interpreter, powered by PyO3.

See the example for more details.

Struct Type

If the function returns a struct type, you can return a class instance or a dictionary.

use arrow_schema::{DataType, Field};
use arrow_udf_python::{CallMode, Runtime};

let mut runtime = Runtime::new().unwrap();
let python_code = r#"
class KeyValue:
    def __init__(self, key, value):
        self.key = key
        self.value = value

def key_value(s: str):
    key, value = s.split('=')

    ## return a class instance
    return KeyValue(key, value)

    ## or return a dict
    return {"key": key, "value": value}
"#;
let return_type = DataType::Struct(
    vec![
        Field::new("key", DataType::Utf8, true),
        Field::new("value", DataType::Utf8, true),
    ]
    .into(),
);
let mode = CallMode::ReturnNullOnNullInput;
runtime.add_function("key_value", return_type, mode, python_code).unwrap();

Extension Type

This crate also supports the following Arrow extension types:

Extension Type Physical Type ARROW:extension:name Python Type
JSON String arrowudf.json any (parsed by json.loads)
Decimal String arrowudf.decimal decimal.Decimal
Pickle Binary arrowudf.pickle any (parsed by pickle.loads)

Pickle Type

When a field is pickle type, the data is stored in a binary array in serialized form.

# use arrow_schema::{Field, DataType};
# use arrow_array::BinaryArray;
let pickle_field = Field::new("pickle", DataType::Binary, true)
    .with_metadata([("ARROW:extension:name".into(), "arrowudf.pickle".into())].into());
let pickle_array = BinaryArray::from(vec![&b"xxxxx"[..]]);

Pickle type is useful for the state of aggregation functions when the state is complex.

Dependencies

~13–19MB
~264K SLoC