#parquet #cli #tpchgen #tpchgen-cli

app tpchgen-cli

Blazing fast pure Rust TPC-H data generator command line tool

4 releases (2 stable)

new 1.1.0 Apr 29, 2025
1.0.0 Apr 12, 2025
0.1.1 Apr 5, 2025
0.1.0 Mar 30, 2025

#61 in #parquet

Download history 40/week @ 2025-03-24 178/week @ 2025-03-31 152/week @ 2025-04-07 69/week @ 2025-04-14 11/week @ 2025-04-21 108/week @ 2025-04-28

382 downloads per month

Apache-2.0

3.5MB
5.5K SLoC

TPC-H Data Generator CLI

See the main README.md for full documentation.

Installation

Install Using Python

Install this tool with Python:

pip install tpchgen-cli

Install Using Rust

Install Rust and this tool:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
cargo install tpchgen-cli

CLI Usage

We tried to make the tpchgen-cli experience as close to dbgen as possible for no other reason than maybe make it easier for you to have a drop-in replacement.

$ tpchgen-cli -h
TPC-H Data Generator

Usage: tpchgen-cli [OPTIONS]

Options:
  -s, --scale-factor <SCALE_FACTOR>
          Scale factor to address (default: 1) [default: 1]
  -o, --output-dir <OUTPUT_DIR>
          Output directory for generated files (default: current directory) [default: .]
  -T, --tables <TABLES>
          Which tables to generate (default: all) [possible values: region, nation, supplier, customer, part, partsupp, orders, lineitem]
  -p, --parts <PARTS>
          Number of parts to generate (manual parallel generation) [default: 1]
      --part <PART>
          Which part to generate (1-based, only relevant if parts > 1) [default: 1]
  -f, --format <FORMAT>
          Output format: tbl, csv, parquet (default: tbl) [default: tbl] [possible values: tbl, csv, parquet]
  -n, --num-threads <NUM_THREADS>
          The number of threads for parallel generation, defaults to the number of CPUs [default: 8]
  -c, --parquet-compression <PARQUET_COMPRESSION>
          Parquet block compression format. Default is SNAPPY [default: SNAPPY]
  -v, --verbose
          Verbose output (default: false)
      --stdout
          Write the output to stdout instead of a file
  -h, --help
          Print help (see more with '--help')

For example generating a dataset with a scale factor of 1 (1GB) can be done like this:

$ tpchgen-cli -s 1 --output-dir=/tmp/tpch

Dependencies

~43MB
~822K SLoC