ARROW-10712: [Rust] Add tests to TPC-H benchmarks #8760

andygrove · 2020-11-24T17:19:10Z

This adds tests to the TPC-H benchmark suite that only run if the TPCH_DATA environment variable exists and points to a directory containing tbl files.

This is useful when running tests locally and adds a mechanism that we could leverage in CI. We could have a docker image container that includes the data generator and generate the SF=1 data set before running the cargo tests.

github-actions · 2020-11-24T17:28:27Z

https://issues.apache.org/jira/browse/ARROW-10712

Dandandan · 2020-11-25T14:49:18Z

rust/benchmarks/src/bin/tpch.rs

+        verify_query(12).await
+    }
+
+    async fn verify_query(n: usize) -> Result<()> {


maybe also test the mem_table / some other options by adding at as parameter in verify_query ?

alamb

looks like a nice improvement to me

rust/benchmarks/src/bin/tpch.rs

alamb · 2020-12-02T16:53:43Z

rust/benchmarks/src/bin/tpch.rs

+        verify_query(12).await
+    }
+
+    async fn verify_query(n: usize) -> Result<()> {


I think it would help to document here the expectations (e.g. copy the description from the PR as comments).

Co-authored-by: Andrew Lamb <[email protected]>

seddonm1 · 2020-12-05T21:38:30Z

rust/benchmarks/src/bin/tpch.rs

                mem_table: false,
            };
            benchmark(opt).await?
        }


You have a duplicate }

andygrove · 2020-12-08T14:39:09Z

I will be getting back to this PR in the next day or two.

seddonm1 · 2020-12-08T19:53:00Z

I can help with this if you can describe your plans.

andygrove · 2020-12-08T20:04:53Z

Thanks @seddonm1 that would be great if you have the time. I was really just planning on addressing feedback. Feel free to push to this PR or create a new one to replace this.

seddonm1 · 2020-12-08T20:07:22Z

@andygrove No worries. Hopefully I can help on some of these easier tasks to free you up for the harder ones.

seddonm1 · 2020-12-13T22:07:25Z

Hi @andygrove
So I looked at this over the weekend. I was thinking that we could just embed the expected TPC-H answers (given the deterministic inputs we have chosen) and store them as Parquet. This is similar to how the Databricks TPC-H works: https://github.com/databricks/tpch-dbgen/tree/master/answers.

I have produced results with Spark with single partition parquets (snappy) and that would require somewhere around 3.5MB. We could also do a limit n (which is what it looks like databricks have done) and just check the answers are contained in the result set to reduce data requirements.

I have also had trouble generating the parquet files with your program. Here is my alternative way to generate the test dataset which needs to be run from within the tpch-dbgen directory (or you can change the volume mount):

docker run \
--rm \
--volume $(pwd):/tpch:Z \
--env "ETL_CONF_ENV=production" \
--env "CONF_NUM_PARITIONS=10" \
--env "INPUT_PATH=/tpch/tbl" \
--env "OUTPUT_PATH=/tpch/parquet" \
--env "SCHEMA_PATH=https://raw.githubusercontent.com/tripl-ai/arc-starter/master/examples/tpch/schema" \
--entrypoint="" \
--publish 4040:4040 \
ghcr.io/tripl-ai/arc:arc_3.6.2_spark_3.0.1_scala_2.12_hadoop_3.2.0_1.10.0 \
bin/spark-submit \
--master local\[\*\] \
--driver-memory 4G \
--driver-java-options "-XX:+UseG1GC" \
--class ai.tripl.arc.ARC \
/opt/spark/jars/arc.jar \
--etl.config.uri=https://raw.githubusercontent.com/tripl-ai/arc-starter/master/examples/tpch/tpch.ipynb

seddonm1 · 2020-12-16T06:51:37Z

@andygrove

here is my suggested approach which has already caught an issue:
https://github.com/apache/arrow/compare/master...seddonm1:test-tpch?expand=1

andygrove · 2020-12-16T15:24:58Z

Thanks @seddonm1 I like this approach. We have a separate repo (arrow-testing) that is a git submodule of the main arrow repo where we check in files like this. There is sometimes some pushback on adding large files here (quite rightly) so I think it might be a good idea to raise this proposal on the mailing list first.

There might be value here for other implementations (like C++) that are building query engines.

seddonm1 · 2020-12-16T21:02:54Z

Thanks @andygrove I will raise it there. I am hoping Decimal support can land before we need to commit parquet files to the repo.

andygrove · 2020-12-20T20:43:43Z

Closing this since it is superseded by other work

TPC-H tests

6a62648

andygrove changed the title ~~ARROW-10712: Add tests to TPC-H benchmarks~~ ARROW-10712: [Rust] Add tests to TPC-H benchmarks Nov 24, 2020

github-actions bot added the Component: Rust label Nov 24, 2020

Dandandan reviewed Nov 25, 2020

View reviewed changes

alamb approved these changes Dec 2, 2020

View reviewed changes

rust/benchmarks/src/bin/tpch.rs Show resolved Hide resolved

alamb reviewed Dec 2, 2020

View reviewed changes

Update rust/benchmarks/src/bin/tpch.rs

0438e80

Co-authored-by: Andrew Lamb <[email protected]>

seddonm1 reviewed Dec 5, 2020

View reviewed changes

rust/benchmarks/src/bin/tpch.rs

mem_table: false,

};

benchmark(opt).await?

}

Copy link

Contributor

seddonm1 Dec 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have a duplicate }

seddonm1 mentioned this pull request Dec 14, 2020

ARROW-10908: [Rust][DataFusion] Update relevant tpch-queries with BETWEEN #8906

Closed

andygrove closed this Dec 20, 2020

seddonm1 mentioned this pull request Dec 27, 2020

ARROW-10712: [Rust] [DataFusion] Add tests to TPC-H benchmarks #9015

Closed

asfimport mentioned this pull request Dec 28, 2020

[Rust] [DataFusion] Add tests to TPC-H benchmarks #18390

Closed

ARROW-10712: [Rust] Add tests to TPC-H benchmarks #8760

ARROW-10712: [Rust] Add tests to TPC-H benchmarks #8760

Uh oh!

Conversation

andygrove commented Nov 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Nov 24, 2020

Uh oh!

Dandandan Nov 25, 2020

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alamb Dec 2, 2020

Choose a reason for hiding this comment

Uh oh!

seddonm1 Dec 5, 2020

Choose a reason for hiding this comment

Uh oh!

andygrove commented Dec 8, 2020

Uh oh!

seddonm1 commented Dec 8, 2020

Uh oh!

andygrove commented Dec 8, 2020

Uh oh!

seddonm1 commented Dec 8, 2020

Uh oh!

seddonm1 commented Dec 13, 2020

Uh oh!

seddonm1 commented Dec 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andygrove commented Dec 16, 2020

Uh oh!

seddonm1 commented Dec 16, 2020

Uh oh!

andygrove commented Dec 20, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

andygrove commented Nov 24, 2020 •

edited

Loading

seddonm1 commented Dec 16, 2020 •

edited

Loading