3 stable releases

new 44.0.0 Jan 22, 2025
43.0.0 Jan 22, 2025
42.2.0 Jan 22, 2025

#1974 in Database interfaces

Download history 317/week @ 2025-01-20

317 downloads per month

Apache-2.0

4MB
69K SLoC

DataFusion sqllogictest

DataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format.

This crate is a submodule of DataFusion that contains an implementation of sqllogictest.

Overview

This crate uses sqllogictest-rs to parse and run .slt files in the test_files directory of this crate.

Testing setup

  1. rustup update stable DataFusion uses the latest stable release of rust
  2. git submodule init
  3. git submodule update

Running tests: TLDR Examples

# Run all tests
cargo test --test sqllogictests
# Run all tests, with debug logging enabled
RUST_LOG=debug cargo test --test sqllogictests
# Run only the tests in `information_schema.slt`
cargo test --test sqllogictests -- information_schema
# Automatically update ddl.slt with expected output
cargo test --test sqllogictests -- ddl --complete
# Run ddl.slt, printing debug logging to stdout
RUST_LOG=debug cargo test --test sqllogictests -- ddl

Cookbook: Adding Tests

  1. Add queries

Add the setup and queries you want to run to a .slt file (my_awesome_test.slt in this example) using the following format:

query
CREATE TABLE foo AS VALUES (1);

query
SELECT * from foo;
  1. Fill in expected answers with --complete mode

Running the following command will update my_awesome_test.slt with the expected output:

cargo test --test sqllogictests -- my_awesome_test --complete
  1. Verify the content

In the case above, my_awesome_test.slt will look something like

statement ok
CREATE TABLE foo AS VALUES (1);

query I
SELECT * from foo;
----
1

Assuming it looks good, check it in!

Cookbook: Testing for whitespace

The sqllogictest runner will automatically strip trailing whitespace, meaning it requires an additional effort to verify that trailing whitespace is correctly produced

For example, the following test can't distinguish between Andrew and Andrew (with trailing space):

query T
select substr('Andrew Lamb', 1, 7)
----
Andrew

To test trailing whitespace, project additional non-whitespace column on the right. For example, by selecting '|' after the column of interest, the test can distinguish between Andrew and Andrew :

# Note two spaces between `Andrew` and `|`
query TT
select substr('Andrew Lamb', 1, 7), '|'
----
Andrew  |

# Note only one space between `Andrew` and `|`
query TT
select substr('Andrew Lamb', 1, 6), '|'
----
Andrew |

Reference

Running tests: Validation Mode

In this mode, sqllogictests runs the statements and queries in a .slt file, comparing the expected output in the file to the output produced by that run.

For example, to run all tests suites in validation mode

cargo test --test sqllogictests

sqllogictests also supports cargo test style substring matches on file names to restrict which tests to run

# information_schema.slt matches due to substring matching `information`
cargo test --test sqllogictests -- information

Running tests: Postgres compatibility

Test files that start with prefix pg_compat_ verify compatibility with Postgres by running the same script files both with DataFusion and with Postgres

In order to run the sqllogictests running against a previously running Postgres instance, do:

PG_COMPAT=true PG_URI="postgresql://postgres@127.0.0.1/postgres" cargo test --features=postgres --test sqllogictests

The environment variables:

  1. PG_COMPAT instructs sqllogictest to run against Postgres (not DataFusion)
  2. PG_URI contains a libpq style connection string, whose format is described in the docs

One way to create a suitable a posgres container in docker is to use the Official Image with a command such as the following. Note the collation must be set to C otherwise ORDER BY will not match DataFusion and the tests will diff.

docker run \
  -p5432:5432 \
  -e POSTGRES_INITDB_ARGS="--encoding=UTF-8 --lc-collate=C --lc-ctype=C" \
  -e POSTGRES_HOST_AUTH_METHOD=trust \
  postgres

Running Tests: tpch

Test files in tpch directory runs against the TPCH data set (SF = 0.1), which must be generated before running. You can use following command to generate tpch data, assuming you are in the repository root:

mkdir -p datafusion/sqllogictest/test_files/tpch/data
docker run -it \
  -v "$(realpath datafusion/sqllogictest/test_files/tpch/data)":/data \
  ghcr.io/scalytics/tpch-docker:main -vf -s 0.1

Then you need to add INCLUDE_TPCH=true to run tpch tests:

INCLUDE_TPCH=true cargo test --test sqllogictests

Updating tests: Completion Mode

In test script completion mode, sqllogictests reads a prototype script and runs the statements and queries against the database engine. The output is a full script that is a copy of the prototype script with result inserted.

You can update the tests / generate expected output by passing the --complete argument.

# Update ddl.slt with output from running
cargo test --test sqllogictests -- ddl --complete

Running tests: scratchdir

The DataFusion sqllogictest runner automatically creates a directory named test_files/scratch/<filename>, creating it if needed and clearing any file contents if it exists.

For example, the test_files/copy.slt file should use scratch directory test_files/scratch/copy.

Tests that need to write temporary files should write (only) to this directory to ensure they do not interfere with others concurrently running tests.

.slt file format

sqllogictest was originally written for SQLite to verify the correctness of SQL queries against the SQLite engine. The format is designed engine-agnostic and can parse sqllogictest files (.slt), runs queries against an SQL engine and compares the output to the expected output.

Tests in the .slt file are a sequence of query records generally starting with CREATE statements to populate tables and then further queries to test the populated data.

Each .slt file runs in its own, isolated SessionContext, to make the test setup explicit and so they can run in parallel. Thus it important to keep the tests from having externally visible side effects (like writing to a global location such as /tmp/)

Query records follow the format:

# <test_name>
query <type_string> <sort_mode>
<sql_query>
----
<expected_result>
  • test_name: Uniquely identify the test name (DataFusion only)
  • type_string: A short string that specifies the number of result columns and the expected datatype of each result column. There is one character in the <type_string> for each result column. The characters codes are:
    • 'B' - Boolean,
    • 'D' - Datetime,
    • 'I' - Integer,
    • 'P' - timestamP,
    • 'R' - floating-point results,
    • 'T' - Text,
    • "?" - any other types
  • expected_result: In the results section, some values are converted according to some rules:
    • floating point values are rounded to the scale of "12",
    • NULL values are rendered as NULL,
    • empty strings are rendered as (empty),
    • boolean values are rendered as true/false,
    • this list can be not exhaustive, check the datafusion/sqllogictest/src/engines/conversion.rs for details.
  • sort_mode: If included, it must be one of nosort (default), rowsort, or valuesort. In nosort mode, the results appear in exactly the order in which they were received from the database engine. The nosort mode should only be used on queries that have an ORDER BY clause or which only have a single row of result, since otherwise the order of results is undefined and might vary from one database engine to another. The rowsort mode gathers all output from the database engine then sorts it by rows on the client side. Sort comparisons use sort_unstable on the rendered text representation of the values. Hence, "9" sorts after "10", not before. The valuesort mode works like rowsort except that it does not honor row groupings. Each individual result value is sorted on its own.

⚠️ It is encouraged to either apply order by, or use rowsort for queries without explicit order by clauses.

Example

# group_by_distinct
query TTI
SELECT a, b, COUNT(DISTINCT c) FROM my_table GROUP BY a, b ORDER BY a, b
----
foo bar 10
foo baz 5
foo     4
        3

Dependencies

~71MB
~1.5M SLoC