11 unstable releases (3 breaking)
0.4.0 | May 2, 2024 |
---|---|
0.3.0 | Apr 23, 2024 |
0.2.0 | Sep 27, 2022 |
0.1.7 | Jun 10, 2022 |
0.1.0 | May 30, 2022 |
#1877 in Parser implementations
63KB
1K
SLoC
DataFusion-CatalogProvider-Glue
Glue as a CatalogProvider for Datafusion.
Output from demo example:
mbpro16(timvw)➜ datafusion-catalogprovider-glue git:(main) ✗ cargo run --example=demo
Compiling datafusion-catalogprovider-glue v0.1.0 (/Users/timvw/src/github/datafusion-catalogprovider-glue)
Finished dev [unoptimized + debuginfo] target(s) in 7.43s
Running `target/debug/examples/demo`
registering tpc-h-parquet-1.customer
+---------------+--------------------+------------+------------+
| table_catalog | table_schema | table_name | table_type |
+---------------+--------------------+------------+------------+
| glue | tpc-h-parquet-1 | customer | BASE TABLE |
| glue | information_schema | tables | VIEW |
| glue | information_schema | columns | VIEW |
| datafusion | information_schema | tables | VIEW |
| datafusion | information_schema | columns | VIEW |
+---------------+--------------------+------------+------------+
+---------------+-----------------+------------+--------------+------------------+----------------+-------------+-----------+--------------------------+------------------------+-------------------+-------------------------+---------------+--------------------+---------------+
| table_catalog | table_schema | table_name | column_name | ordinal_position | column_default | is_nullable | data_type | character_maximum_length | character_octet_length | numeric_precision | numeric_precision_radix | numeric_scale | datetime_precision | interval_type |
+---------------+-----------------+------------+--------------+------------------+----------------+-------------+-----------+--------------------------+------------------------+-------------------+-------------------------+---------------+--------------------+---------------+
| glue | tpc-h-parquet-1 | customer | c_custkey | 0 | | NO | Int64 | | | | | | | |
| glue | tpc-h-parquet-1 | customer | c_name | 1 | | YES | Utf8 | | 2147483647 | | | | | |
| glue | tpc-h-parquet-1 | customer | c_address | 2 | | YES | Utf8 | | 2147483647 | | | | | |
| glue | tpc-h-parquet-1 | customer | c_nationkey | 3 | | NO | Int64 | | | | | | | |
| glue | tpc-h-parquet-1 | customer | c_phone | 4 | | YES | Utf8 | | 2147483647 | | | | | |
| glue | tpc-h-parquet-1 | customer | c_acctbal | 5 | | NO | Float64 | | | 24 | 2 | | | |
| glue | tpc-h-parquet-1 | customer | c_mktsegment | 6 | | YES | Utf8 | | 2147483647 | | | | | |
| glue | tpc-h-parquet-1 | customer | c_comment | 7 | | YES | Utf8 | | 2147483647 | | | | | |
+---------------+-----------------+------------+--------------+------------------+----------------+-------------+-----------+--------------------------+------------------------+-------------------+-------------------------+---------------+--------------------+---------------+
+-----------+--------------------+---------------------------------------+-------------+-----------------+-----------+--------------+-------------------------------------------------------------------------------------------------------------------+
| c_custkey | c_name | c_address | c_nationkey | c_phone | c_acctbal | c_mktsegment | c_comment |
+-----------+--------------------+---------------------------------------+-------------+-----------------+-----------+--------------+-------------------------------------------------------------------------------------------------------------------+
| 1 | Customer#000000001 | IVhzIApeRb ot,c,E | 15 | 25-989-741-2988 | 711.56 | BUILDING | to the even, regular platelets. regular, ironic epitaphs nag e |
| 2 | Customer#000000002 | XSTf4,NCwDVaWNe6tEgvwfmRchLXak | 13 | 23-768-687-3665 | 121.65 | AUTOMOBILE | l accounts. blithely ironic theodolites integrate boldly: caref |
| 3 | Customer#000000003 | MG9kdTD2WBHm | 1 | 11-719-748-3364 | 7498.12 | AUTOMOBILE | deposits eat slyly ironic, even instructions. express foxes detect slyly. blithely even accounts abov |
| 4 | Customer#000000004 | XxVSJsLAGtn | 4 | 14-128-190-5944 | 2866.83 | MACHINERY | requests. final, regular ideas sleep final accou |
| 5 | Customer#000000005 | KvpyuHCplrB84WgAiGV6sYpZq7Tj | 3 | 13-750-942-6364 | 794.47 | HOUSEHOLD | n accounts will have to unwind. foxes cajole accor |
| 6 | Customer#000000006 | sKZz0CsnMD7mp4Xd0YrBvx,LREYKUWAh yVn | 20 | 30-114-968-4951 | 7638.57 | AUTOMOBILE | tions. even deposits boost according to the slyly bold packages. final accounts cajole requests. furious |
| 7 | Customer#000000007 | TcGe5gaZNgVePxU5kRrvXBfkasDTea | 18 | 28-190-982-9759 | 9561.95 | AUTOMOBILE | ainst the ironic, express theodolites. express, even pinto beans among the exp |
| 8 | Customer#000000008 | I0B10bB0AymmC, 0PrRYBCP1yGJ8xcBPmWhl5 | 17 | 27-147-574-9335 | 6819.74 | BUILDING | among the slyly regular theodolites kindle blithely courts. carefully even theodolites haggle slyly along the ide |
| 9 | Customer#000000009 | xKiAFTjUsCuxfeleNqefumTrjS | 8 | 18-338-906-3675 | 8324.07 | FURNITURE | r theodolites according to the requests wake thinly excuses: pending requests haggle furiousl |
| 10 | Customer#000000010 | 6LrEaV6KR6PLVcgl2ArL Q3rqzLzcT1 v2 | 5 | 15-741-346-9870 | 2753.54 | HOUSEHOLD | es regular deposits haggle. fur |
+-----------+--------------------+---------------------------------------+-------------+-----------------+-----------+--------------+-------------------------------------------------------------------------------------------------------------------+
Known issues
- Not possible to infer whether a column is nullable or not. Currently we default to true.
failed to sample datafusion.parquet_testing_datapage_v2_snappy_parquet due to Arrow error: External error: Execution error: Failed to map column projection for field e. Incompatible data types List(Field { name: "element", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: None }) and List(Field { name: "element", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None })
- Sometimes datafusion infers a column to be of type Time32(Millisecond) instead of Timestamp(Nanosecond, None)
failed to sample datafusion.parquet_testing_encrypt_columns_plaintext_footer_parquet_encrypted due to Arrow error: External error: Execution error: Failed to map column projection for field int32_field. Incompatible data types Time32(Millisecond) and Timestamp(Nanosecond, None)
- Not possible to infer that a column is empty (and then gets data_type: Null)
failed to sample datafusion.parquet_testing_null_list_parquet due to Arrow error: External error: Execution error: Failed to map column projection for field emptylist. Incompatible data types List(Field { name: "item", data_type: Null, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }) and List(Field { name: "element", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None })
- Not possible to infer that a column is unsigned or signed
failed to sample datafusion.parquet_testing_nested_structs_rust_parquet due to Arrow error: External error: Execution error: Failed to map column projection for field roll_num. Incompatible data types
Development
Standard rust toolchain, eg:
cargo build
cargo test
Run all linting:
./dev/rust_lint.sh
Testing
First clone the test data repository:
git submodule update --init --recursive .
When this does not work:
git submodule add -f https://github.com/apache/parquet-testing.git parquet-testing
git submodule add -f https://github.com/apache/arrow-testing testing
Upload the testdata:
aws s3api create-bucket \
--bucket datafusion-testing \
--region eu-central-1 \
--create-bucket-configuration LocationConstraint=eu-central-1
find testing -type f -exec aws s3 cp ./{} s3://datafusion-{} \;
aws s3api create-bucket \
--bucket datafusion-parquet-testing \
--region eu-central-1 \
--create-bucket-configuration LocationConstraint=eu-central-1
find parquet-testing -type f -exec aws s3 cp ./{} s3://datafusion-{} \;
Create Glue databases:
aws glue create-database \
--database-input "{\"Name\":\"datafusion\"}"
Dependencies
~120MB
~2.5M SLoC