[SPARK-35345][SQL] Add Parquet tests to BloomFilterBenchmark #32473

huaxingao · 2021-05-08T02:46:42Z

What changes were proposed in this pull request?

Add BloomFilter Benchmark test for Parquet

Why are the changes needed?

Currently, we only have BloomFilter Benchmark test for ORC. Will add one for Parquet too.

Does this PR introduce any user-facing change?

no

How was this patch tested?

tested the newly added benchmark test

dongjoon-hyun

Thank you, @huaxingao .

dongjoon-hyun

Could you update the benchmark test result here?

sunchao · 2021-05-08T03:17:31Z

Also @huaxingao it's recommended to generate benchmark results through Github actions: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks

SparkQA · 2021-05-08T03:53:54Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42794/

SparkQA · 2021-05-08T07:17:41Z

Test build #138272 has finished for PR 32473 at commit 1bb6675.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

huaxingao · 2021-05-08T16:06:54Z

The benchmark test result for Parquet bloom filter is quite depressing. Did I do anything wrong?

SparkQA · 2021-05-08T16:48:59Z

Test build #138296 has finished for PR 32473 at commit 2a3f5eb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-05-08T16:58:31Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42818/

SparkQA · 2021-05-08T17:03:06Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42818/

dongjoon-hyun · 2021-05-08T20:09:43Z

sql/core/benchmarks/BloomFilterBenchmark-jdk11-results.txt

+Read a row from 100M rows:                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
+------------------------------------------------------------------------------------------------------------------------
+Without bloom filter                               1201           1230          41         83.3          12.0       1.0X
+With bloom filter                                  1262           1301          54         79.2          12.6       1.0X


It looks strange to me, too.

The benchmark test result for Parquet bloom filter is quite depressing. Did I do anything wrong?

dongjoon-hyun · 2021-05-08T20:14:36Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/BloomFilterBenchmark.scala

+        }
+        benchmark.addCase("With bloom filter") { _ =>
+          df.write.mode("overwrite")
+            .option(ParquetOutputFormat.BLOOM_FILTER_ENABLED + "#value", true)


Please debug with this code path. This might be insufficient to enable bloom filters in Parquet library for some reason. The usual suspects are 1) the parameter is not handed over correctly to Parquet, 2) the parameter needs another parameters, 3) this specific data type is not supported well yet in Parquet.

.option(ParquetOutputFormat.BLOOM_FILTER_ENABLED + "#value", true)

cc @ggershinsky since he is a Parquet committer.

cc @chenjunjiedada

dongjoon-hyun · 2021-05-08T20:15:48Z

BTW, thank you for leading this effort in Apache Spark community, @huaxingao . Since this is the first try, you will remove the roadblocks in advance for the users. If there is a drawback, you can file a JIRA to Apache Parquet community and documents in Apache Spark 3.2.0 documentation until it's resolved in the Apache Parquet community.

sunchao · 2021-05-08T22:28:17Z

cc @wangyum too

viirya · 2021-05-09T09:20:30Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/BloomFilterBenchmark.scala

+      df.write.parquet(path + "/withoutBF")
+      df.write.option(ParquetOutputFormat.BLOOM_FILTER_ENABLED + "#value", true)
+        .parquet(path + "/withBF")


You need to set row group size, e.g.

df.write.option(ParquetOutputFormat.BLOOM_FILTER_ENABLED + "#value", true) .option("parquet.block.size", 1024 * 1024) .parquet(path + "/withBF")

Then you will see the benchmark difference.

[info] Running benchmark: Read a row from 100M rows [info] Running case: Without bloom filter [info] Stopped after 3 iterations, 2674 ms [info] Running case: With bloom filter [info] Stopped after 5 iterations, 2383 ms [info] OpenJDK 64-Bit Server VM 1.8.0_265-b01 on Mac OS X 10.16 [info] Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz [info] Read a row from 100M rows: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] Without bloom filter 872 892 19 114.7 8.7 1.0X [info] With bloom filter 473 477 3 211.4 4.7 1.8X

Wow that's a great improvement! I wonder if we should make the row group size a parameter for the benchmark too. There seems to be other related parameters too such as DEFAULT_MAX_BLOOM_FILTER_BYTES

huaxingao · 2021-05-09T21:46:54Z

The result looks really good now. Thanks everyone! Your guys are extremely helpful!

viirya · 2021-05-09T21:48:09Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/BloomFilterBenchmark.scala

+      withTempPath { dir =>
+        val path = dir.getCanonicalPath
+
+        df.write.parquet(path + "/withoutBF")


When comparing withoutBF and withBF, I think we should use the same parquet.block.size for both cases. It might also affects the numbers of withoutBF.

Sorry, forgot to change that. Fixed. Will update results

SparkQA · 2021-05-09T21:52:59Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42831/

SparkQA · 2021-05-09T21:53:00Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42831/

SparkQA · 2021-05-09T22:53:02Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42833/

SparkQA · 2021-05-09T22:53:03Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42833/

dongjoon-hyun

Shall we change the grouping in order see the trend according to the block size? For example,

...
Without bloom filter, blocksize: 8388608           1005           1011           8         99.5          10.0       1.0X
Without bloom filter, blocksize: 9437184            992           1002          14        100.8           9.9       1.0X
With bloom filter, blocksize: 8388608               385            404          20        259.6           3.9       2.6X
With bloom filter, blocksize: 9437184               521            538          16        191.9           5.2       1.9X
...

This reverts commit e5ee38c.

This reverts commit 5fc105f.

SparkQA · 2021-06-09T02:32:30Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44042/

SparkQA · 2021-06-09T03:08:11Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44042/

SparkQA · 2021-06-09T04:19:10Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44048/

SparkQA · 2021-06-09T05:12:58Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44052/

SparkQA · 2021-06-09T05:46:48Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44052/

SparkQA · 2021-06-09T06:17:46Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44058/

viirya

Is the benchmark run with the built-in In predicate in Parquet? Or still using current one in Spark?

SparkQA · 2021-06-09T06:56:03Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44058/

SparkQA · 2021-06-09T07:17:59Z

Test build #139523 has finished for PR 32473 at commit 324fedb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-06-09T07:51:09Z

Test build #139527 has finished for PR 32473 at commit d6e320c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-06-09T08:41:16Z

Test build #139533 has finished for PR 32473 at commit d481ec1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-06-09T09:56:08Z

Test build #139517 has finished for PR 32473 at commit 89eb722.

This patch fails from timeout after a configured wait of 500m.
This patch does not merge cleanly.
This patch adds no public classes.

huaxingao · 2021-06-09T16:26:35Z

@viirya

Is the benchmark run with the built-in In predicate in Parquet? Or still using current one in Spark?

It's still using the current one in Spark. That's why I kept on decreasing the num of predicates, otherwise, i got OOM.

viirya · 2021-06-10T19:30:33Z

sql/core/benchmarks/BloomFilterBenchmark-jdk11-results.txt

+Without bloom filter                                 70             76           6         14.2          70.2       1.0X
+With bloom filter                                    73            103          22         13.8          72.6       1.0X


Bloom filter is slower. It is due to IN predicate problem?

For JDK8, bloom filter seems faster a bit.

Not due to IN predicate problem because ORC also seems a bit slower with bloom filter. I think the data is too small. Let me increase the data size and try again.

viirya · 2021-06-10T19:32:38Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/BloomFilterBenchmark.scala

+      df2.repartition(col("value")).sort(col("value"))
+        .write.option("orc.bloom.filter.columns", "value").orc(path + "/withBF")
+
+      runBenchmark(s"ORC Read for IN set") {


nit: s"" is not needed.

viirya · 2021-06-10T19:32:50Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/BloomFilterBenchmark.scala

+        .write.option("orc.bloom.filter.columns", "value").orc(path + "/withBF")
+
+      runBenchmark(s"ORC Read for IN set") {
+        val benchmark = new Benchmark(s"Read a row from 1M rows", 1000 * 1000, output = output)


viirya

Some unnecessary s"". And a question about IN predicate with bloom filter case in Parquet. Otherwise looks okay.

SparkQA · 2021-06-11T10:57:33Z

Test build #139689 has finished for PR 32473 at commit 82e1e8e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-08-16T08:05:28Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46995/

SparkQA · 2021-08-16T09:24:51Z

Test build #142484 has finished for PR 32473 at commit 82e1e8e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-08-16T10:11:10Z

Test build #142490 has finished for PR 32473 at commit 82e1e8e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

huaxingao · 2021-11-14T23:45:47Z

This PR is kind of messy. I will close this. The new PR is here #34594

huaxingao added 2 commits May 7, 2021 18:59

[SPARK-35345][SQL] Add BloomFilter Benchmark test for Parquet

ed16a1a

enable ParquetInputFormat bloom filter to true

fa02810

github-actions bot added the SQL label May 8, 2021

fix lint scala

1bb6675

dongjoon-hyun reviewed May 8, 2021

View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-35345][SQL] Add BloomFilter Benchmark test for Parquet~~ [SPARK-35345][SQL] Add Parquet tests to BloomFilterBenchmark May 8, 2021

dongjoon-hyun previously requested changes May 8, 2021

View reviewed changes

update benchmark test result

2a3f5eb

dongjoon-hyun reviewed May 8, 2021

View reviewed changes

viirya reviewed May 9, 2021

View reviewed changes

huaxingao added 2 commits May 9, 2021 13:17

set parquet block size

10d7a97

update benchmark result'

c8375d6

viirya reviewed May 9, 2021

View reviewed changes

set parquet.block.size for withoutBF

34d0511

dongjoon-hyun reviewed May 9, 2021

View reviewed changes

huaxingao added 2 commits June 8, 2021 18:53

Revert "[SPARK-35535][SQL] New data source V2 API: LocalScan"

9a9f0ff

This reverts commit e5ee38c.

Revert "[SPARK-35559][TEST] Speed up one test in AdaptiveQueryExecSuite"

324fedb

This reverts commit 5fc105f.

reduce # of predicate

d6e320c

update test results

d481ec1

viirya reviewed Jun 9, 2021

View reviewed changes

viirya reviewed Jun 10, 2021

View reviewed changes

huaxingao added 2 commits June 10, 2021 22:19

address comment

63edef3

fix error

82e1e8e

jshmchenxi mentioned this pull request Nov 12, 2021

Core: Support writing parquet bloom filter apache/iceberg#2642

Closed

huaxingao closed this Nov 14, 2021

		Without bloom filter 70 76 6 14.2 70.2 1.0X
		With bloom filter 73 103 22 13.8 72.6 1.0X

[SPARK-35345][SQL] Add Parquet tests to BloomFilterBenchmark #32473

[SPARK-35345][SQL] Add Parquet tests to BloomFilterBenchmark #32473

Uh oh!

Conversation

huaxingao commented May 8, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

sunchao commented May 8, 2021

Uh oh!

SparkQA commented May 8, 2021

Uh oh!

SparkQA commented May 8, 2021

Uh oh!

huaxingao commented May 8, 2021

Uh oh!

SparkQA commented May 8, 2021

Uh oh!

SparkQA commented May 8, 2021

Uh oh!

SparkQA commented May 8, 2021

Uh oh!

dongjoon-hyun May 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun May 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented May 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sunchao commented May 8, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

huaxingao commented May 9, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 9, 2021

Uh oh!

SparkQA commented May 9, 2021

Uh oh!

SparkQA commented May 9, 2021

Uh oh!

SparkQA commented May 9, 2021

Uh oh!

dongjoon-hyun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 9, 2021

Uh oh!

SparkQA commented Jun 9, 2021

Uh oh!

SparkQA commented Jun 9, 2021

Uh oh!

SparkQA commented Jun 9, 2021

Uh oh!

SparkQA commented Jun 9, 2021

Uh oh!

SparkQA commented Jun 9, 2021

Uh oh!

viirya left a comment

dongjoon-hyun May 8, 2021 •

edited

Loading

dongjoon-hyun May 8, 2021 •

edited

Loading

dongjoon-hyun commented May 8, 2021 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading