Skip to content

Commit 55bc4cb

Browse files
committed
[SPARK-46752][SQL][TESTS] Use default ORC compression in data source benchmarks
### What changes were proposed in this pull request? This PR aims to use the default ORC compression in data source benchmarks. ### Why are the changes needed? Apache ORC 2.0 and Apache Spark 4.0 will use ZStandard as the default ORC compression codec. - apache/orc#1733 - #44654 `OrcReadBenchmark` was switched to use ZStandard for comparision. - #44761 And, this PR aims to change the remaining three data source benchmarks. ``` $ git grep OrcCompressionCodec | grep Benchmark sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/BuiltInDataSourceWriteBenchmark.scala:import org.apache.spark.sql.execution.datasources.orc.OrcCompressionCodec sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/BuiltInDataSourceWriteBenchmark.scala: OrcCompressionCodec.SNAPPY.lowerCaseName()) sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala:import org.apache.spark.sql.execution.datasources.orc.OrcCompressionCodec sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala: OrcCompressionCodec.SNAPPY.lowerCaseName()).orc(dir) sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala:import org.apache.spark.sql.execution.datasources.orc.OrcCompressionCodec sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala: .setIfMissing("orc.compression", OrcCompressionCodec.SNAPPY.lowerCaseName()) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44777 from dongjoon-hyun/SPARK-46752. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
1 parent 751a81b commit 55bc4cb

9 files changed

+1395
-1402
lines changed

sql/core/benchmarks/BuiltInDataSourceWriteBenchmark-jdk21-results.txt

+30-30
Original file line numberDiff line numberDiff line change
@@ -2,69 +2,69 @@
22
Parquet writer benchmark
33
================================================================================================
44

5-
OpenJDK 64-Bit Server VM 21.0.1+12-LTS on Linux 5.15.0-1053-azure
5+
OpenJDK 64-Bit Server VM 21.0.2+13-LTS on Linux 5.15.0-1053-azure
66
AMD EPYC 7763 64-Core Processor
77
Parquet(PARQUET_1_0) writer benchmark: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
88
------------------------------------------------------------------------------------------------------------------------
9-
Output Single Int Column 1641 1686 63 9.6 104.4 1.0X
10-
Output Single Double Column 1991 2022 43 7.9 126.6 0.8X
11-
Output Int and String Column 4249 4305 79 3.7 270.1 0.4X
12-
Output Partitions 3165 3189 33 5.0 201.3 0.5X
13-
Output Buckets 4429 4447 25 3.6 281.6 0.4X
9+
Output Single Int Column 1677 1729 73 9.4 106.6 1.0X
10+
Output Single Double Column 1879 1895 22 8.4 119.4 0.9X
11+
Output Int and String Column 4135 4144 13 3.8 262.9 0.4X
12+
Output Partitions 3027 3042 21 5.2 192.4 0.6X
13+
Output Buckets 4258 4265 11 3.7 270.7 0.4X
1414

15-
OpenJDK 64-Bit Server VM 21.0.1+12-LTS on Linux 5.15.0-1053-azure
15+
OpenJDK 64-Bit Server VM 21.0.2+13-LTS on Linux 5.15.0-1053-azure
1616
AMD EPYC 7763 64-Core Processor
1717
Parquet(PARQUET_2_0) writer benchmark: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
1818
------------------------------------------------------------------------------------------------------------------------
19-
Output Single Int Column 1666 1688 32 9.4 105.9 1.0X
20-
Output Single Double Column 1611 1627 24 9.8 102.4 1.0X
21-
Output Int and String Column 4817 4829 17 3.3 306.2 0.3X
22-
Output Partitions 3210 3237 38 4.9 204.1 0.5X
23-
Output Buckets 4375 4384 13 3.6 278.2 0.4X
19+
Output Single Int Column 1770 1797 38 8.9 112.5 1.0X
20+
Output Single Double Column 1673 1703 43 9.4 106.3 1.1X
21+
Output Int and String Column 4330 4337 10 3.6 275.3 0.4X
22+
Output Partitions 3129 3133 6 5.0 198.9 0.6X
23+
Output Buckets 4160 4192 46 3.8 264.5 0.4X
2424

2525

2626
================================================================================================
2727
ORC writer benchmark
2828
================================================================================================
2929

30-
OpenJDK 64-Bit Server VM 21.0.1+12-LTS on Linux 5.15.0-1053-azure
30+
OpenJDK 64-Bit Server VM 21.0.2+13-LTS on Linux 5.15.0-1053-azure
3131
AMD EPYC 7763 64-Core Processor
3232
ORC writer benchmark: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
3333
------------------------------------------------------------------------------------------------------------------------
34-
Output Single Int Column 915 945 28 17.2 58.2 1.0X
35-
Output Single Double Column 1432 1445 18 11.0 91.0 0.6X
36-
Output Int and String Column 3749 3771 31 4.2 238.3 0.2X
37-
Output Partitions 2458 2472 19 6.4 156.3 0.4X
38-
Output Buckets 3339 3358 27 4.7 212.3 0.3X
34+
Output Single Int Column 1043 1057 20 15.1 66.3 1.0X
35+
Output Single Double Column 2217 2243 37 7.1 141.0 0.5X
36+
Output Int and String Column 4123 4151 40 3.8 262.1 0.3X
37+
Output Partitions 2399 2399 1 6.6 152.5 0.4X
38+
Output Buckets 3555 3589 48 4.4 226.0 0.3X
3939

4040

4141
================================================================================================
4242
JSON writer benchmark
4343
================================================================================================
4444

45-
OpenJDK 64-Bit Server VM 21.0.1+12-LTS on Linux 5.15.0-1053-azure
45+
OpenJDK 64-Bit Server VM 21.0.2+13-LTS on Linux 5.15.0-1053-azure
4646
AMD EPYC 7763 64-Core Processor
4747
JSON writer benchmark: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
4848
------------------------------------------------------------------------------------------------------------------------
49-
Output Single Int Column 1395 1431 51 11.3 88.7 1.0X
50-
Output Single Double Column 2105 2128 33 7.5 133.8 0.7X
51-
Output Int and String Column 3762 3773 16 4.2 239.2 0.4X
52-
Output Partitions 2937 2957 28 5.4 186.7 0.5X
53-
Output Buckets 3788 3822 48 4.2 240.9 0.4X
49+
Output Single Int Column 1444 1452 11 10.9 91.8 1.0X
50+
Output Single Double Column 2082 2090 11 7.6 132.4 0.7X
51+
Output Int and String Column 3729 3779 71 4.2 237.1 0.4X
52+
Output Partitions 2781 2804 32 5.7 176.8 0.5X
53+
Output Buckets 3664 3901 335 4.3 233.0 0.4X
5454

5555

5656
================================================================================================
5757
CSV writer benchmark
5858
================================================================================================
5959

60-
OpenJDK 64-Bit Server VM 21.0.1+12-LTS on Linux 5.15.0-1053-azure
60+
OpenJDK 64-Bit Server VM 21.0.2+13-LTS on Linux 5.15.0-1053-azure
6161
AMD EPYC 7763 64-Core Processor
6262
CSV writer benchmark: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
6363
------------------------------------------------------------------------------------------------------------------------
64-
Output Single Int Column 3511 3524 19 4.5 223.2 1.0X
65-
Output Single Double Column 4243 4250 9 3.7 269.8 0.8X
66-
Output Int and String Column 6532 6557 37 2.4 415.3 0.5X
67-
Output Partitions 5369 5374 6 2.9 341.4 0.7X
68-
Output Buckets 6524 6531 10 2.4 414.8 0.5X
64+
Output Single Int Column 3771 3793 31 4.2 239.7 1.0X
65+
Output Single Double Column 4404 4419 21 3.6 280.0 0.9X
66+
Output Int and String Column 6020 6031 16 2.6 382.7 0.6X
67+
Output Partitions 5232 5260 38 3.0 332.7 0.7X
68+
Output Buckets 6560 6576 23 2.4 417.1 0.6X
6969

7070

sql/core/benchmarks/BuiltInDataSourceWriteBenchmark-results.txt

+30-30
Original file line numberDiff line numberDiff line change
@@ -2,69 +2,69 @@
22
Parquet writer benchmark
33
================================================================================================
44

5-
OpenJDK 64-Bit Server VM 17.0.9+9-LTS on Linux 5.15.0-1053-azure
5+
OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Linux 5.15.0-1053-azure
66
AMD EPYC 7763 64-Core Processor
77
Parquet(PARQUET_1_0) writer benchmark: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
88
------------------------------------------------------------------------------------------------------------------------
9-
Output Single Int Column 1673 1702 40 9.4 106.4 1.0X
10-
Output Single Double Column 1728 1729 1 9.1 109.9 1.0X
11-
Output Int and String Column 4300 4367 96 3.7 273.4 0.4X
12-
Output Partitions 3065 3156 129 5.1 194.9 0.5X
13-
Output Buckets 4114 4115 2 3.8 261.6 0.4X
9+
Output Single Int Column 1663 1688 35 9.5 105.7 1.0X
10+
Output Single Double Column 1748 1797 69 9.0 111.2 1.0X
11+
Output Int and String Column 4313 4371 82 3.6 274.2 0.4X
12+
Output Partitions 3069 3124 77 5.1 195.1 0.5X
13+
Output Buckets 4149 4179 42 3.8 263.8 0.4X
1414

15-
OpenJDK 64-Bit Server VM 17.0.9+9-LTS on Linux 5.15.0-1053-azure
15+
OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Linux 5.15.0-1053-azure
1616
AMD EPYC 7763 64-Core Processor
1717
Parquet(PARQUET_2_0) writer benchmark: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
1818
------------------------------------------------------------------------------------------------------------------------
19-
Output Single Int Column 1650 1661 15 9.5 104.9 1.0X
20-
Output Single Double Column 1660 1670 14 9.5 105.6 1.0X
21-
Output Int and String Column 4645 4655 14 3.4 295.3 0.4X
22-
Output Partitions 3083 3086 5 5.1 196.0 0.5X
23-
Output Buckets 3937 3975 53 4.0 250.3 0.4X
19+
Output Single Int Column 1618 1619 1 9.7 102.9 1.0X
20+
Output Single Double Column 1639 1646 10 9.6 104.2 1.0X
21+
Output Int and String Column 4493 4493 1 3.5 285.7 0.4X
22+
Output Partitions 3086 3098 16 5.1 196.2 0.5X
23+
Output Buckets 4021 4054 47 3.9 255.7 0.4X
2424

2525

2626
================================================================================================
2727
ORC writer benchmark
2828
================================================================================================
2929

30-
OpenJDK 64-Bit Server VM 17.0.9+9-LTS on Linux 5.15.0-1053-azure
30+
OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Linux 5.15.0-1053-azure
3131
AMD EPYC 7763 64-Core Processor
3232
ORC writer benchmark: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
3333
------------------------------------------------------------------------------------------------------------------------
34-
Output Single Int Column 918 929 10 17.1 58.4 1.0X
35-
Output Single Double Column 1381 1382 2 11.4 87.8 0.7X
36-
Output Int and String Column 3952 4028 108 4.0 251.2 0.2X
37-
Output Partitions 2315 2319 5 6.8 147.2 0.4X
38-
Output Buckets 2904 2915 15 5.4 184.6 0.3X
34+
Output Single Int Column 870 873 4 18.1 55.3 1.0X
35+
Output Single Double Column 2075 2104 40 7.6 132.0 0.4X
36+
Output Int and String Column 4568 4576 11 3.4 290.4 0.2X
37+
Output Partitions 2489 2492 4 6.3 158.3 0.3X
38+
Output Buckets 3555 3566 16 4.4 226.0 0.2X
3939

4040

4141
================================================================================================
4242
JSON writer benchmark
4343
================================================================================================
4444

45-
OpenJDK 64-Bit Server VM 17.0.9+9-LTS on Linux 5.15.0-1053-azure
45+
OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Linux 5.15.0-1053-azure
4646
AMD EPYC 7763 64-Core Processor
4747
JSON writer benchmark: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
4848
------------------------------------------------------------------------------------------------------------------------
49-
Output Single Int Column 1350 1358 11 11.6 85.8 1.0X
50-
Output Single Double Column 2017 2045 39 7.8 128.2 0.7X
51-
Output Int and String Column 3554 3593 56 4.4 225.9 0.4X
52-
Output Partitions 2809 2816 10 5.6 178.6 0.5X
53-
Output Buckets 3656 3667 15 4.3 232.4 0.4X
49+
Output Single Int Column 1348 1354 8 11.7 85.7 1.0X
50+
Output Single Double Column 1958 1970 18 8.0 124.5 0.7X
51+
Output Int and String Column 4112 4116 5 3.8 261.5 0.3X
52+
Output Partitions 2983 3017 49 5.3 189.6 0.5X
53+
Output Buckets 3810 3821 15 4.1 242.2 0.4X
5454

5555

5656
================================================================================================
5757
CSV writer benchmark
5858
================================================================================================
5959

60-
OpenJDK 64-Bit Server VM 17.0.9+9-LTS on Linux 5.15.0-1053-azure
60+
OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Linux 5.15.0-1053-azure
6161
AMD EPYC 7763 64-Core Processor
6262
CSV writer benchmark: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
6363
------------------------------------------------------------------------------------------------------------------------
64-
Output Single Int Column 3151 3153 3 5.0 200.3 1.0X
65-
Output Single Double Column 3469 3470 1 4.5 220.6 0.9X
66-
Output Int and String Column 5851 5860 13 2.7 372.0 0.5X
67-
Output Partitions 4830 4842 16 3.3 307.1 0.7X
68-
Output Buckets 6158 6181 32 2.6 391.5 0.5X
64+
Output Single Int Column 2911 2970 85 5.4 185.0 1.0X
65+
Output Single Double Column 3668 3676 12 4.3 233.2 0.8X
66+
Output Int and String Column 5563 5564 1 2.8 353.7 0.5X
67+
Output Partitions 4608 4636 39 3.4 293.0 0.6X
68+
Output Buckets 5745 5771 38 2.7 365.2 0.5X
6969

7070

0 commit comments

Comments
 (0)