Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Apr 9, 2020

What changes were proposed in this pull request?

In regular ORC reader when spark.sql.orc.enableVectorizedReader is set to false, I propose to use DaysWritable in reading DATE values from ORC files. Currently, days from ORC files are converted to java.sql.Date, and then to days in Proleptic Gregorian calendar. So, the conversion to Java type can be eliminated.

Why are the changes needed?

  • The PR fixes regressions in loading dates before the 1582 year from ORC files by when vectorised ORC reader is off.
  • The changes improve performance of regular ORC reader for DATE columns.
    • x3.6 faster comparing to the current master
    • x1.9-x4.3 faster against Spark 2.4.6

Before (on JDK 8):

Load dates from ORC:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
after 1582, vec off                               39651          39686          31          2.5         396.5       1.0X
after 1582, vec on                                 3647           3660          13         27.4          36.5      10.9X
before 1582, vec off                              38155          38219          61          2.6         381.6       1.0X
before 1582, vec on                                4041           4046           6         24.7          40.4       9.8X

After (on JDK 8):

Load dates from ORC:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
after 1582, vec off                               10947          10971          28          9.1         109.5       1.0X
after 1582, vec on                                 3677           3702          36         27.2          36.8       3.0X
before 1582, vec off                              11456          11472          21          8.7         114.6       1.0X
before 1582, vec on                                4079           4103          21         24.5          40.8       2.7X

Spark 2.4.6:

Load dates from ORC:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
after 1582, vec off                               48169          48276          96          2.1         481.7       1.0X
after 1582, vec on                                 5375           5410          41         18.6          53.7       9.0X
before 1582, vec off                              22353          22482         198          4.5         223.5       2.2X
before 1582, vec on                                5474           5475           1         18.3          54.7       8.8X

Does this PR introduce any user-facing change?

No

How was this patch tested?

  • By existing tests suites like DateTimeUtilsSuite
  • Checked for hive-1.2 by:
./build/sbt -Phive-1.2 "test:testOnly *OrcHadoopFsRelationSuite"
  • Re-run DateTimeRebaseBenchmark in the environment:
Item Description
Region us-west-2 (Oregon)
Instance r3.xlarge
AMI ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5)
Java OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10

@MaxGekk MaxGekk changed the title [WIP][SQL] Speed up dates reading in ORC [SPARK-31398][SQL] Speed up reading dates in ORC Apr 9, 2020
@MaxGekk
Copy link
Member Author

MaxGekk commented Apr 9, 2020

@cloud-fan @HyukjinKwon @dongjoon-hyun Please, have a look at the PR.

@SparkQA
Copy link

SparkQA commented Apr 9, 2020

Test build #121030 has finished for PR 28169 at commit 168d64d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 9, 2020

Test build #121033 has finished for PR 28169 at commit 69bcf23.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk MaxGekk changed the title [SPARK-31398][SQL] Speed up reading dates in ORC [SPARK-31398][SQL][test-hive1.2] Speed up reading dates in ORC Apr 9, 2020
@MaxGekk
Copy link
Member Author

MaxGekk commented Apr 9, 2020

jenkins, retest this, please

@SparkQA
Copy link

SparkQA commented Apr 10, 2020

Test build #121044 has finished for PR 28169 at commit 69bcf23.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

LGTM. Can you check the benchmark numbers with Spark 2.4? Just want to see how much perf regression we have in 3.0 after this patch.

@MaxGekk
Copy link
Member Author

MaxGekk commented Apr 10, 2020

Can you check the benchmark numbers with Spark 2.4? Just want to see how much perf regression we have in 3.0 after this patch.

@cloud-fan To have comparable results, need to port:

  1. NoOp datasource
  2. Changes in Benchmark framework to save results to files

@MaxGekk MaxGekk changed the title [SPARK-31398][SQL][test-hive1.2] Speed up reading dates in ORC [SPARK-31398][SQL] Speed up reading dates in ORC Apr 10, 2020
@cloud-fan
Copy link
Contributor

  1. we can use df.queryExecution.toRdd.foreach
  2. we don't need to commit the benchmark result in 2.4, just need to take a look at the numbers and put it in PR description.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Apr 10, 2020

Test build #121084 has finished for PR 28169 at commit 69bcf23.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

+1 for @cloud-fan 's comment.

@MaxGekk
Copy link
Member Author

MaxGekk commented Apr 10, 2020

I ran the benchmark DateTimeRebaseBenchmark on 2.4.6-SNAPSHOT (MaxGekk@9657575):

OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Load dates from ORC:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
after 1582, vec off                               48169          48276          96          2.1         481.7       1.0X
after 1582, vec on                                 5375           5410          41         18.6          53.7       9.0X
before 1582, vec off                              22353          22482         198          4.5         223.5       2.2X
before 1582, vec on                                5474           5475           1         18.3          54.7       8.8X

Here is the PR MaxGekk#27.

  • After 1582, it is ~4 times faster
  • Before 1582, it is ~2 times faster

@MaxGekk MaxGekk changed the title [SPARK-31398][SQL] Speed up reading dates in ORC [SPARK-31398][SQL] Fix regression of loading dates before 1582 year by non-vectorized ORC reader. Apr 12, 2020
@MaxGekk
Copy link
Member Author

MaxGekk commented Apr 12, 2020

@cloud-fan @HyukjinKwon @dongjoon-hyun Please, review this PR.

@cloud-fan cloud-fan changed the title [SPARK-31398][SQL] Fix regression of loading dates before 1582 year by non-vectorized ORC reader. [SPARK-31398][SQL] Fix perf regression of loading dates before 1582 year by non-vectorized ORC reader. Apr 13, 2020
@cloud-fan cloud-fan changed the title [SPARK-31398][SQL] Fix perf regression of loading dates before 1582 year by non-vectorized ORC reader. [SPARK-31398][SQL] Fix perf regression of loading dates before 1582 year by non-vectorized ORC reader Apr 13, 2020
@cloud-fan cloud-fan closed this in cac8d1b Apr 13, 2020
cloud-fan pushed a commit that referenced this pull request Apr 13, 2020
…ear by non-vectorized ORC reader

### What changes were proposed in this pull request?
In regular ORC reader when `spark.sql.orc.enableVectorizedReader` is set to `false`, I propose to use `DaysWritable` in reading DATE values from ORC files. Currently, days from ORC files are converted to java.sql.Date, and then to days in Proleptic Gregorian calendar. So, the conversion to Java type can be eliminated.

### Why are the changes needed?
- The PR fixes regressions in loading dates before the 1582 year from ORC files by when vectorised ORC reader is off.
- The changes improve performance of regular ORC reader for DATE columns.
  - x3.6 faster comparing to the current master
  - x1.9-x4.3 faster against Spark 2.4.6

Before (on JDK 8):
```
Load dates from ORC:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
after 1582, vec off                               39651          39686          31          2.5         396.5       1.0X
after 1582, vec on                                 3647           3660          13         27.4          36.5      10.9X
before 1582, vec off                              38155          38219          61          2.6         381.6       1.0X
before 1582, vec on                                4041           4046           6         24.7          40.4       9.8X
```

After (on JDK 8):
```
Load dates from ORC:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
after 1582, vec off                               10947          10971          28          9.1         109.5       1.0X
after 1582, vec on                                 3677           3702          36         27.2          36.8       3.0X
before 1582, vec off                              11456          11472          21          8.7         114.6       1.0X
before 1582, vec on                                4079           4103          21         24.5          40.8       2.7X
```

Spark 2.4.6:
```
Load dates from ORC:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
after 1582, vec off                               48169          48276          96          2.1         481.7       1.0X
after 1582, vec on                                 5375           5410          41         18.6          53.7       9.0X
before 1582, vec off                              22353          22482         198          4.5         223.5       2.2X
before 1582, vec on                                5474           5475           1         18.3          54.7       8.8X
```

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
- By existing tests suites like `DateTimeUtilsSuite`
- Checked for `hive-1.2` by:
```
./build/sbt -Phive-1.2 "test:testOnly *OrcHadoopFsRelationSuite"
```
- Re-run `DateTimeRebaseBenchmark` in the environment:

| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) |
| Java | OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 |

Closes #28169 from MaxGekk/orc-optimize-dates.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit cac8d1b)
Signed-off-by: Wenchen Fan <[email protected]>
@cloud-fan
Copy link
Contributor

thanks, merging to master/3.0!

sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
…ear by non-vectorized ORC reader

### What changes were proposed in this pull request?
In regular ORC reader when `spark.sql.orc.enableVectorizedReader` is set to `false`, I propose to use `DaysWritable` in reading DATE values from ORC files. Currently, days from ORC files are converted to java.sql.Date, and then to days in Proleptic Gregorian calendar. So, the conversion to Java type can be eliminated.

### Why are the changes needed?
- The PR fixes regressions in loading dates before the 1582 year from ORC files by when vectorised ORC reader is off.
- The changes improve performance of regular ORC reader for DATE columns.
  - x3.6 faster comparing to the current master
  - x1.9-x4.3 faster against Spark 2.4.6

Before (on JDK 8):
```
Load dates from ORC:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
after 1582, vec off                               39651          39686          31          2.5         396.5       1.0X
after 1582, vec on                                 3647           3660          13         27.4          36.5      10.9X
before 1582, vec off                              38155          38219          61          2.6         381.6       1.0X
before 1582, vec on                                4041           4046           6         24.7          40.4       9.8X
```

After (on JDK 8):
```
Load dates from ORC:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
after 1582, vec off                               10947          10971          28          9.1         109.5       1.0X
after 1582, vec on                                 3677           3702          36         27.2          36.8       3.0X
before 1582, vec off                              11456          11472          21          8.7         114.6       1.0X
before 1582, vec on                                4079           4103          21         24.5          40.8       2.7X
```

Spark 2.4.6:
```
Load dates from ORC:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
after 1582, vec off                               48169          48276          96          2.1         481.7       1.0X
after 1582, vec on                                 5375           5410          41         18.6          53.7       9.0X
before 1582, vec off                              22353          22482         198          4.5         223.5       2.2X
before 1582, vec on                                5474           5475           1         18.3          54.7       8.8X
```

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
- By existing tests suites like `DateTimeUtilsSuite`
- Checked for `hive-1.2` by:
```
./build/sbt -Phive-1.2 "test:testOnly *OrcHadoopFsRelationSuite"
```
- Re-run `DateTimeRebaseBenchmark` in the environment:

| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) |
| Java | OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 |

Closes apache#28169 from MaxGekk/orc-optimize-dates.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
@MaxGekk MaxGekk deleted the orc-optimize-dates branch June 5, 2020 19:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants