Skip to content

Conversation

@andygrove
Copy link
Member

@andygrove andygrove commented Nov 24, 2020

This adds tests to the TPC-H benchmark suite that only run if the TPCH_DATA environment variable exists and points to a directory containing tbl files.

This is useful when running tests locally and adds a mechanism that we could leverage in CI. We could have a docker image container that includes the data generator and generate the SF=1 data set before running the cargo tests.

@andygrove andygrove changed the title ARROW-10712: Add tests to TPC-H benchmarks ARROW-10712: [Rust] Add tests to TPC-H benchmarks Nov 24, 2020
@github-actions
Copy link

verify_query(12).await
}

async fn verify_query(n: usize) -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe also test the mem_table / some other options by adding at as parameter in verify_query ?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like a nice improvement to me

verify_query(12).await
}

async fn verify_query(n: usize) -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would help to document here the expectations (e.g. copy the description from the PR as comments).

mem_table: false,
};
benchmark(opt).await?
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have a duplicate }

@andygrove
Copy link
Member Author

I will be getting back to this PR in the next day or two.

@seddonm1
Copy link
Contributor

seddonm1 commented Dec 8, 2020

I can help with this if you can describe your plans.

@andygrove
Copy link
Member Author

Thanks @seddonm1 that would be great if you have the time. I was really just planning on addressing feedback. Feel free to push to this PR or create a new one to replace this.

@seddonm1
Copy link
Contributor

seddonm1 commented Dec 8, 2020

@andygrove No worries. Hopefully I can help on some of these easier tasks to free you up for the harder ones.

@seddonm1
Copy link
Contributor

Hi @andygrove
So I looked at this over the weekend. I was thinking that we could just embed the expected TPC-H answers (given the deterministic inputs we have chosen) and store them as Parquet. This is similar to how the Databricks TPC-H works: https://github.com/databricks/tpch-dbgen/tree/master/answers.

I have produced results with Spark with single partition parquets (snappy) and that would require somewhere around 3.5MB. We could also do a limit n (which is what it looks like databricks have done) and just check the answers are contained in the result set to reduce data requirements.

I have also had trouble generating the parquet files with your program. Here is my alternative way to generate the test dataset which needs to be run from within the tpch-dbgen directory (or you can change the volume mount):

docker run \
--rm \
--volume $(pwd):/tpch:Z \
--env "ETL_CONF_ENV=production" \
--env "CONF_NUM_PARITIONS=10" \
--env "INPUT_PATH=/tpch/tbl" \
--env "OUTPUT_PATH=/tpch/parquet" \
--env "SCHEMA_PATH=https://raw.githubusercontent.com/tripl-ai/arc-starter/master/examples/tpch/schema" \
--entrypoint="" \
--publish 4040:4040 \
ghcr.io/tripl-ai/arc:arc_3.6.2_spark_3.0.1_scala_2.12_hadoop_3.2.0_1.10.0 \
bin/spark-submit \
--master local\[\*\] \
--driver-memory 4G \
--driver-java-options "-XX:+UseG1GC" \
--class ai.tripl.arc.ARC \
/opt/spark/jars/arc.jar \
--etl.config.uri=https://raw.githubusercontent.com/tripl-ai/arc-starter/master/examples/tpch/tpch.ipynb

@seddonm1
Copy link
Contributor

seddonm1 commented Dec 16, 2020

@andygrove

here is my suggested approach which has already caught an issue:
https://github.com/apache/arrow/compare/master...seddonm1:test-tpch?expand=1

@andygrove
Copy link
Member Author

Thanks @seddonm1 I like this approach. We have a separate repo (arrow-testing) that is a git submodule of the main arrow repo where we check in files like this. There is sometimes some pushback on adding large files here (quite rightly) so I think it might be a good idea to raise this proposal on the mailing list first.

There might be value here for other implementations (like C++) that are building query engines.

@seddonm1
Copy link
Contributor

Thanks @andygrove I will raise it there. I am hoping Decimal support can land before we need to commit parquet files to the repo.

@andygrove
Copy link
Member Author

Closing this since it is superseded by other work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants