[SPARK-31398][SQL] Fix perf regression of loading dates before 1582 year by non-vectorized ORC reader #28169

MaxGekk · 2020-04-09T11:20:55Z

What changes were proposed in this pull request?

In regular ORC reader when spark.sql.orc.enableVectorizedReader is set to false, I propose to use DaysWritable in reading DATE values from ORC files. Currently, days from ORC files are converted to java.sql.Date, and then to days in Proleptic Gregorian calendar. So, the conversion to Java type can be eliminated.

Why are the changes needed?

The PR fixes regressions in loading dates before the 1582 year from ORC files by when vectorised ORC reader is off.
The changes improve performance of regular ORC reader for DATE columns.
- x3.6 faster comparing to the current master
- x1.9-x4.3 faster against Spark 2.4.6

Before (on JDK 8):

Load dates from ORC:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
after 1582, vec off                               39651          39686          31          2.5         396.5       1.0X
after 1582, vec on                                 3647           3660          13         27.4          36.5      10.9X
before 1582, vec off                              38155          38219          61          2.6         381.6       1.0X
before 1582, vec on                                4041           4046           6         24.7          40.4       9.8X

After (on JDK 8):

Load dates from ORC:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
after 1582, vec off                               10947          10971          28          9.1         109.5       1.0X
after 1582, vec on                                 3677           3702          36         27.2          36.8       3.0X
before 1582, vec off                              11456          11472          21          8.7         114.6       1.0X
before 1582, vec on                                4079           4103          21         24.5          40.8       2.7X

Spark 2.4.6:

Load dates from ORC:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
after 1582, vec off                               48169          48276          96          2.1         481.7       1.0X
after 1582, vec on                                 5375           5410          41         18.6          53.7       9.0X
before 1582, vec off                              22353          22482         198          4.5         223.5       2.2X
before 1582, vec on                                5474           5475           1         18.3          54.7       8.8X

Does this PR introduce any user-facing change?

No

How was this patch tested?

By existing tests suites like DateTimeUtilsSuite
Checked for hive-1.2 by:

./build/sbt -Phive-1.2 "test:testOnly *OrcHadoopFsRelationSuite"

Re-run DateTimeRebaseBenchmark in the environment:

Item	Description
Region	us-west-2 (Oregon)
Instance	r3.xlarge
AMI	ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5)
Java	OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10

…c-optimize-dates

MaxGekk · 2020-04-09T13:39:28Z

@cloud-fan @HyukjinKwon @dongjoon-hyun Please, have a look at the PR.

SparkQA · 2020-04-09T15:59:47Z

Test build #121030 has finished for PR 28169 at commit 168d64d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-04-09T18:16:05Z

Test build #121033 has finished for PR 28169 at commit 69bcf23.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2020-04-09T20:05:12Z

jenkins, retest this, please

SparkQA · 2020-04-10T00:22:31Z

Test build #121044 has finished for PR 28169 at commit 69bcf23.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-04-10T04:12:03Z

LGTM. Can you check the benchmark numbers with Spark 2.4? Just want to see how much perf regression we have in 3.0 after this patch.

MaxGekk · 2020-04-10T04:51:27Z

Can you check the benchmark numbers with Spark 2.4? Just want to see how much perf regression we have in 3.0 after this patch.

@cloud-fan To have comparable results, need to port:

NoOp datasource
Changes in Benchmark framework to save results to files

cloud-fan · 2020-04-10T11:19:17Z

we can use df.queryExecution.toRdd.foreach
we don't need to commit the benchmark result in 2.4, just need to take a look at the numbers and put it in PR description.

cloud-fan · 2020-04-10T11:19:32Z

retest this please

SparkQA · 2020-04-10T15:59:51Z

Test build #121084 has finished for PR 28169 at commit 69bcf23.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-04-10T19:43:01Z

+1 for @cloud-fan 's comment.

MaxGekk · 2020-04-10T20:40:00Z

I ran the benchmark DateTimeRebaseBenchmark on 2.4.6-SNAPSHOT (MaxGekk@9657575):

OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Load dates from ORC:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
after 1582, vec off                               48169          48276          96          2.1         481.7       1.0X
after 1582, vec on                                 5375           5410          41         18.6          53.7       9.0X
before 1582, vec off                              22353          22482         198          4.5         223.5       2.2X
before 1582, vec on                                5474           5475           1         18.3          54.7       8.8X

Here is the PR MaxGekk#27.

After 1582, it is ~4 times faster
Before 1582, it is ~2 times faster

MaxGekk · 2020-04-12T16:40:20Z

@cloud-fan @HyukjinKwon @dongjoon-hyun Please, review this PR.

…ear by non-vectorized ORC reader ### What changes were proposed in this pull request? In regular ORC reader when `spark.sql.orc.enableVectorizedReader` is set to `false`, I propose to use `DaysWritable` in reading DATE values from ORC files. Currently, days from ORC files are converted to java.sql.Date, and then to days in Proleptic Gregorian calendar. So, the conversion to Java type can be eliminated. ### Why are the changes needed? - The PR fixes regressions in loading dates before the 1582 year from ORC files by when vectorised ORC reader is off. - The changes improve performance of regular ORC reader for DATE columns. - x3.6 faster comparing to the current master - x1.9-x4.3 faster against Spark 2.4.6 Before (on JDK 8): ``` Load dates from ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ after 1582, vec off 39651 39686 31 2.5 396.5 1.0X after 1582, vec on 3647 3660 13 27.4 36.5 10.9X before 1582, vec off 38155 38219 61 2.6 381.6 1.0X before 1582, vec on 4041 4046 6 24.7 40.4 9.8X ``` After (on JDK 8): ``` Load dates from ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ after 1582, vec off 10947 10971 28 9.1 109.5 1.0X after 1582, vec on 3677 3702 36 27.2 36.8 3.0X before 1582, vec off 11456 11472 21 8.7 114.6 1.0X before 1582, vec on 4079 4103 21 24.5 40.8 2.7X ``` Spark 2.4.6: ``` Load dates from ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ after 1582, vec off 48169 48276 96 2.1 481.7 1.0X after 1582, vec on 5375 5410 41 18.6 53.7 9.0X before 1582, vec off 22353 22482 198 4.5 223.5 2.2X before 1582, vec on 5474 5475 1 18.3 54.7 8.8X ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? - By existing tests suites like `DateTimeUtilsSuite` - Checked for `hive-1.2` by: ``` ./build/sbt -Phive-1.2 "test:testOnly *OrcHadoopFsRelationSuite" ``` - Re-run `DateTimeRebaseBenchmark` in the environment: | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge | | AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) | | Java | OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 | Closes #28169 from MaxGekk/orc-optimize-dates. Authored-by: Max Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit cac8d1b) Signed-off-by: Wenchen Fan <[email protected]>

cloud-fan · 2020-04-13T05:30:42Z

thanks, merging to master/3.0!

…ear by non-vectorized ORC reader ### What changes were proposed in this pull request? In regular ORC reader when `spark.sql.orc.enableVectorizedReader` is set to `false`, I propose to use `DaysWritable` in reading DATE values from ORC files. Currently, days from ORC files are converted to java.sql.Date, and then to days in Proleptic Gregorian calendar. So, the conversion to Java type can be eliminated. ### Why are the changes needed? - The PR fixes regressions in loading dates before the 1582 year from ORC files by when vectorised ORC reader is off. - The changes improve performance of regular ORC reader for DATE columns. - x3.6 faster comparing to the current master - x1.9-x4.3 faster against Spark 2.4.6 Before (on JDK 8): ``` Load dates from ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ after 1582, vec off 39651 39686 31 2.5 396.5 1.0X after 1582, vec on 3647 3660 13 27.4 36.5 10.9X before 1582, vec off 38155 38219 61 2.6 381.6 1.0X before 1582, vec on 4041 4046 6 24.7 40.4 9.8X ``` After (on JDK 8): ``` Load dates from ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ after 1582, vec off 10947 10971 28 9.1 109.5 1.0X after 1582, vec on 3677 3702 36 27.2 36.8 3.0X before 1582, vec off 11456 11472 21 8.7 114.6 1.0X before 1582, vec on 4079 4103 21 24.5 40.8 2.7X ``` Spark 2.4.6: ``` Load dates from ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ after 1582, vec off 48169 48276 96 2.1 481.7 1.0X after 1582, vec on 5375 5410 41 18.6 53.7 9.0X before 1582, vec off 22353 22482 198 4.5 223.5 2.2X before 1582, vec on 5474 5475 1 18.3 54.7 8.8X ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? - By existing tests suites like `DateTimeUtilsSuite` - Checked for `hive-1.2` by: ``` ./build/sbt -Phive-1.2 "test:testOnly *OrcHadoopFsRelationSuite" ``` - Re-run `DateTimeRebaseBenchmark` in the environment: | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge | | AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) | | Java | OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 | Closes apache#28169 from MaxGekk/orc-optimize-dates. Authored-by: Max Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

MaxGekk added 6 commits April 9, 2020 14:18

Use rebaseJulianToGregorianDays in read

168d64d

Remove import

1bf0920

Remove import

6740329

Re-gen DateTimeRebaseBenchmark results on JDK 8

1438194

Re-gen DateTimeRebaseBenchmark results on JDK 11

9b5ca6b

Merge branch 'orc-optimize-dates' of github.com:MaxGekk/spark into or…

69bcf23

…c-optimize-dates

MaxGekk changed the title ~~[WIP][SQL] Speed up dates reading in ORC~~ [SPARK-31398][SQL] Speed up reading dates in ORC Apr 9, 2020

MaxGekk changed the title ~~[SPARK-31398][SQL] Speed up reading dates in ORC~~ [SPARK-31398][SQL][test-hive1.2] Speed up reading dates in ORC Apr 9, 2020

MaxGekk changed the title ~~[SPARK-31398][SQL][test-hive1.2] Speed up reading dates in ORC~~ [SPARK-31398][SQL] Speed up reading dates in ORC Apr 10, 2020

MaxGekk changed the title ~~[SPARK-31398][SQL] Speed up reading dates in ORC~~ [SPARK-31398][SQL] Fix regression of loading dates before 1582 year by non-vectorized ORC reader. Apr 12, 2020

cloud-fan changed the title ~~[SPARK-31398][SQL] Fix regression of loading dates before 1582 year by non-vectorized ORC reader.~~ [SPARK-31398][SQL] Fix perf regression of loading dates before 1582 year by non-vectorized ORC reader. Apr 13, 2020

cloud-fan changed the title ~~[SPARK-31398][SQL] Fix perf regression of loading dates before 1582 year by non-vectorized ORC reader.~~ [SPARK-31398][SQL] Fix perf regression of loading dates before 1582 year by non-vectorized ORC reader Apr 13, 2020

cloud-fan closed this in cac8d1b Apr 13, 2020

MaxGekk deleted the orc-optimize-dates branch June 5, 2020 19:47

[SPARK-31398][SQL] Fix perf regression of loading dates before 1582 year by non-vectorized ORC reader #28169

[SPARK-31398][SQL] Fix perf regression of loading dates before 1582 year by non-vectorized ORC reader #28169

Uh oh!

Conversation

MaxGekk commented Apr 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

MaxGekk commented Apr 9, 2020

Uh oh!

SparkQA commented Apr 9, 2020

Uh oh!

SparkQA commented Apr 9, 2020

Uh oh!

MaxGekk commented Apr 9, 2020

Uh oh!

SparkQA commented Apr 10, 2020

Uh oh!

cloud-fan commented Apr 10, 2020

Uh oh!

MaxGekk commented Apr 10, 2020

Uh oh!

cloud-fan commented Apr 10, 2020

Uh oh!

cloud-fan commented Apr 10, 2020

Uh oh!

SparkQA commented Apr 10, 2020

Uh oh!

dongjoon-hyun commented Apr 10, 2020

Uh oh!

MaxGekk commented Apr 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaxGekk commented Apr 12, 2020

Uh oh!

cloud-fan commented Apr 13, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

MaxGekk commented Apr 9, 2020 •

edited

Loading

MaxGekk commented Apr 10, 2020 •

edited

Loading