TPC-DS Extension from duckdb via archive.is

TPC-H and TPC-DS are the most widely used big data benchmarks maintained by TPC. However, the TPC data generator is not open-source, requires an email submission to download, and is not actively maintained. The code is usually designed for GCC 9.x and fails to compile on newer GCC versions. Consequently, generating data for TPC-H and TPC-DS tests is often a frustrating challenge.

But hey, DuckDB to the rescue! No need to build or search for documentation—just use DuckDB instead! In this post, I will demonstrate how to use DuckDB to generate TPC test data and export it as Parquet files for loading.


Install

DuckDB is widely available in various Linux distributions. You can also install it using pip install duckdb. On my Arch Linux system, I use paru -S duckdb to install it.

Generate

Start DuckDB: just run duckdb—no setup, no configuration—just works, like SQLite!

If you want to store data on disk instead of just in memory, run duckdb /path/to/duckdb.data

The extensions tpch and tpcds are bundled and enabled by default. This means we can directly use the functions they provide, such as:

for TPC-H:

CALL dbgen(sf = 1);

for TPC-DS

CALL dsdgen(sf = 1);

Use sf to control the size.

Export

After the test data has been generated, we can use its native EXPORT SQL to export the in-memory data as Parquet:

EXPORT DATABASE 'tpcds_parquet' (FORMAT PARQUET);