-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-10712: [Rust] Add tests to TPC-H benchmarks #8760
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| verify_query(12).await | ||
| } | ||
|
|
||
| async fn verify_query(n: usize) -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe also test the mem_table / some other options by adding at as parameter in verify_query ?
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like a nice improvement to me
| verify_query(12).await | ||
| } | ||
|
|
||
| async fn verify_query(n: usize) -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would help to document here the expectations (e.g. copy the description from the PR as comments).
Co-authored-by: Andrew Lamb <[email protected]>
| mem_table: false, | ||
| }; | ||
| benchmark(opt).await? | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have a duplicate }
|
I will be getting back to this PR in the next day or two. |
|
I can help with this if you can describe your plans. |
|
Thanks @seddonm1 that would be great if you have the time. I was really just planning on addressing feedback. Feel free to push to this PR or create a new one to replace this. |
|
@andygrove No worries. Hopefully I can help on some of these easier tasks to free you up for the harder ones. |
|
Hi @andygrove I have produced results with Spark with single partition parquets (snappy) and that would require somewhere around 3.5MB. We could also do a limit I have also had trouble generating the parquet files with your program. Here is my alternative way to generate the test dataset which needs to be run from within the docker run \
--rm \
--volume $(pwd):/tpch:Z \
--env "ETL_CONF_ENV=production" \
--env "CONF_NUM_PARITIONS=10" \
--env "INPUT_PATH=/tpch/tbl" \
--env "OUTPUT_PATH=/tpch/parquet" \
--env "SCHEMA_PATH=https://raw.githubusercontent.com/tripl-ai/arc-starter/master/examples/tpch/schema" \
--entrypoint="" \
--publish 4040:4040 \
ghcr.io/tripl-ai/arc:arc_3.6.2_spark_3.0.1_scala_2.12_hadoop_3.2.0_1.10.0 \
bin/spark-submit \
--master local\[\*\] \
--driver-memory 4G \
--driver-java-options "-XX:+UseG1GC" \
--class ai.tripl.arc.ARC \
/opt/spark/jars/arc.jar \
--etl.config.uri=https://raw.githubusercontent.com/tripl-ai/arc-starter/master/examples/tpch/tpch.ipynb |
|
here is my suggested approach which has already caught an issue: |
|
Thanks @seddonm1 I like this approach. We have a separate repo (arrow-testing) that is a git submodule of the main arrow repo where we check in files like this. There is sometimes some pushback on adding large files here (quite rightly) so I think it might be a good idea to raise this proposal on the mailing list first. There might be value here for other implementations (like C++) that are building query engines. |
|
Thanks @andygrove I will raise it there. I am hoping Decimal support can land before we need to commit parquet files to the repo. |
|
Closing this since it is superseded by other work |
This adds tests to the TPC-H benchmark suite that only run if the
TPCH_DATAenvironment variable exists and points to a directory containingtblfiles.This is useful when running tests locally and adds a mechanism that we could leverage in CI. We could have a docker image container that includes the data generator and generate the SF=1 data set before running the cargo tests.