-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Automatically download tpcds benchmark data to the right place #19244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
448d2df
8adbf6f
0462ee2
714d721
d1751e5
b74f18a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -631,20 +631,22 @@ data_tpch() { | |
|
|
||
| # Points to TPCDS data generation instructions | ||
| data_tpcds() { | ||
| TPCDS_DIR="${DATA_DIR}" | ||
|
|
||
| # Check if TPCDS data directory exists | ||
| if [ ! -d "${TPCDS_DIR}" ]; then | ||
| echo "" | ||
| echo "For TPC-DS data generation, please clone the datafusion-benchmarks repository:" | ||
| echo " git clone https://github.com/apache/datafusion-benchmarks" | ||
| echo "" | ||
| return 1 | ||
| TPCDS_DIR="${DATA_DIR}/tpcds_sf1" | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Move this to line 43 ?!
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a good idea and it would avoid deuplication. However, none of the other datasets follow this pattern (they all duplicate the paths), so in this case I would prefer to keep the code consistent (we can refactor the common locations into variables as a follow on PR if we want) |
||
|
|
||
| # Check if `web_site.parquet` exists in the TPCDS data directory to verify data presence | ||
| echo "Checking TPC-DS data directory: ${TPCDS_DIR}" | ||
| if [ ! -f "${TPCDS_DIR}/web_site.parquet" ]; then | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Extract a variable for
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same as above |
||
| mkdir -p "${TPCDS_DIR}" | ||
| # Download the DataFusion benchmarks repository zip if it is not already downloaded | ||
| if [ ! -f "${DATA_DIR}/datafusion-benchmarks.zip" ]; then | ||
| echo "Downloading DataFusion benchmarks repository zip to: ${DATA_DIR}/datafusion-benchmarks.zip" | ||
| wget -O "${DATA_DIR}/datafusion-benchmarks.zip" https://github.com/apache/datafusion-benchmarks/archive/refs/heads/main.zip | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: Using
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This this is true. However, the downside is that then the script would require git to be configured and check out the whole repository (I couldn't find any way to get it to extract just a sub directory). But then again maybe that is no worse than a zipfile 🤔 On the other hand the tpcds dataset should never change 🤔 Since this method seems to work, I will go with the wget approach for now and we could change it in a follow on PR if desired
alamb marked this conversation as resolved.
Outdated
|
||
| fi | ||
| echo "Extracting TPC-DS parquet data to ${TPCDS_DIR}..." | ||
| unzip -o -j -d "${TPCDS_DIR}" "${DATA_DIR}/datafusion-benchmarks.zip" datafusion-benchmarks-main/tpcds/data/sf1/* | ||
| echo "TPC-DS data extracted." | ||
| fi | ||
|
|
||
| echo "" | ||
| echo "TPC-DS data already exists in ${TPCDS_DIR}" | ||
| echo "" | ||
| echo "Done." | ||
| } | ||
|
|
||
| # Runs the tpch benchmark | ||
|
|
@@ -682,21 +684,10 @@ run_tpch_mem() { | |
|
|
||
| # Runs the tpcds benchmark | ||
| run_tpcds() { | ||
| TPCDS_DIR="${DATA_DIR}" | ||
|
|
||
| # Check if TPCDS data directory exists | ||
| if [ ! -d "${TPCDS_DIR}" ]; then | ||
| echo "Error: TPC-DS data directory does not exist: ${TPCDS_DIR}" >&2 | ||
| echo "" >&2 | ||
| echo "Please prepare TPC-DS data first by following instructions:" >&2 | ||
| echo " ./bench.sh data tpcds" >&2 | ||
| echo "" >&2 | ||
| exit 1 | ||
| fi | ||
| TPCDS_DIR="${DATA_DIR}/tpcds_sf1" | ||
|
|
||
| # Check if directory contains parquet files | ||
| if ! find "${TPCDS_DIR}" -name "*.parquet" -print -quit | grep -q .; then | ||
| echo "Error: TPC-DS data directory exists but contains no parquet files: ${TPCDS_DIR}" >&2 | ||
| # Check if TPCDS data directory and representative file exists | ||
| if [ ! -f "${TPCDS_DIR}/web_site.parquet" ]; then | ||
| echo "" >&2 | ||
| echo "Please prepare TPC-DS data first by following instructions:" >&2 | ||
| echo " ./bench.sh data tpcds" >&2 | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment is obsolete.
The method does not point anymore, it actually downloads the repo.