diff --git a/benchmarks/README.md b/benchmarks/README.md index b7a3879b22..af644746b1 100644 --- a/benchmarks/README.md +++ b/benchmarks/README.md @@ -29,11 +29,30 @@ These benchmarks are derived from the [TPC-H][1] benchmark. ## Generating Test Data -TPC-H data can be generated using the `tpch-gen.sh` script, which creates a Docker image containing the TPC-DS data -generator. +TPC-H data can be generated using [tpchgen-rs](https://github.com/clflushopt/tpchgen-rs), a fast TPC-H data generator written in Rust. +### Installation + +Install via pip: +```bash +pip install tpchgen-cli +``` + +Or via cargo: +```bash +cargo install tpchgen-cli +``` + +### Generating Data + +Generate SF=1 data in Parquet format: +```bash +tpchgen-cli -s 1 --format parquet --output-dir data +``` + +For larger scale factors (e.g., SF=10): ```bash -./tpch-gen.sh +tpchgen-cli -s 10 --format parquet --output-dir data ``` Data will be generated into the `data` subdirectory and will not be checked in because this directory has been added diff --git a/benchmarks/tpch-gen.sh b/benchmarks/tpch-gen.sh deleted file mode 100755 index ee3d143f6e..0000000000 --- a/benchmarks/tpch-gen.sh +++ /dev/null @@ -1,41 +0,0 @@ -#!/bin/bash -# Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file -# distributed with this work for additional information -# regarding copyright ownership. The ASF licenses this file -# to you under the Apache License, Version 2.0 (the -# "License"); you may not use this file except in compliance -# with the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, -# software distributed under the License is distributed on an -# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -# KIND, either express or implied. See the License for the -# specific language governing permissions and limitations -# under the License. - -mkdir -p data/answers 2>/dev/null - -set -e - -pushd .. -. ./dev/build-set-env.sh -popd - -# Generate data into the ./data directory if it does not already exist -FILE=./data/supplier.tbl -if test -f "$FILE"; then - echo "$FILE exists." -else - docker run -v `pwd`/data:/data -it --rm ghcr.io/scalytics/tpch-docker:main -vf -s 1 -fi - -# Copy expected answers into the ./data/answers directory if it does not already exist -FILE=./data/answers/q1.out -if test -f "$FILE"; then - echo "$FILE exists." -else - docker run -v `pwd`/data:/data -it --entrypoint /bin/bash --rm ghcr.io/scalytics/tpch-docker:main -c "cp /opt/tpch/2.18.0_rc2/dbgen/answers/* /data/answers/" -fi \ No newline at end of file