3 releases (1 stable)
1.0.0 | Apr 12, 2025 |
---|---|
0.1.1 | Apr 5, 2025 |
0.1.0 | Mar 30, 2025 |
#2448 in Database interfaces
448 downloads per month
3.5MB
5K
SLoC
tpchgen-rs
Blazing fast TPCH benchmark data generator, in pure Rust with zero dependencies.
Features
- Blazing Speed 🚀
- Obsessively Tested 📋
- Fully parallel, streaming, constant memory usage 🧠
Try now!
First install Rust and this tool:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
cargo install tpchgen-cli
# create Scale Factor 10 (3.6GB, 8 files, 60M rows in lineitem) in 5 seconds on a modern laptop
tpchgen-cli -s 10 --format=parquet
Or watch this awesome demo recorded by @alamb and the companion blog post in the Datafusion blog.
Performance
Scale Factor | tpchgen-cli |
DuckDB | DuckDB (proprietary) |
---|---|---|---|
1 | 0:02.24 |
0:12.29 |
0:10.68 |
10 | 0:09.97 |
1:46.80 |
1:41.14 |
100 | 1:14.22 |
17:48.27 |
16:40.88 |
1000 | 10:26.26 |
N/A (OOM) | N/A (OOM) |
- DuckDB (proprietary) is the time required to create TPCH data using the proprietary DuckDB format
- Creating Scale Factor 1000 data in DuckDB [requires 647 GB of memory], which is why it is not included in the table above.
Times to create TPCH tables in Parquet format using tpchgen-cli
and duckdb
for various scale factors.
tpchgen-cli
is more than 10x faster than the next
fastest TPCH generator we know of. On a 2023 Mac M3 Max laptop, it easily
generates data faster than can be written to SSD. See
BENCHMARKS.md for more details on performance and
benchmarking.
Testing
This crate has extensive tests to ensure correctness. We compare the output of
this crate with the original dbgen
implementation as part of every checkin.
See TESTING.md for more details.
Crates
-
tpchgen
is the library that implements the data generation logic for TPCH and it can be used to embed data generation logic natively in Rust. -
tpchgen-arrow
is a library for generating in memory Apache Arrow record batches for each of the TPCH tables. -
tpchgen-cli
is adbgen
compatible CLI tool that generates tables from the TPCH benchmark dataset.
Contributing
Pull requests are welcome. For major changes, please open an issue first for discussion. See our contributors guide for more details.
Architecture
Please see architecture guide for details on how the code is structured.
License
The project is licensed under the APACHE 2.0 license.
References
Dependencies
~43MB
~821K SLoC