Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Mar 30, 2020

What changes were proposed in this pull request?

In the PR, I propose to add new benchmarks to DateTimeRebaseBenchmark for saving and loading dates/timestamps to/from ORC files. I extracted common code from the benchmark for Parquet datasource and place it to the methods caseName() and getPath(). Added benchmarks for ORC save/load dates before and after 1582-10-15 because an implementation may have different performance for dates before the Julian calendar cutover day, see #28067 as an example.

Why are the changes needed?

To have the base line for future optimizations of fromJavaDate()/toJavaDate() and toJavaTimestamp()/fromJavaTimestamp() in DateTimeUtils. The methods are used while saving/loading dates/timestamps by ORC datasource.

Does this PR introduce any user-facing change?

No

How was this patch tested?

By running the updated benchmark DateTimeRebaseBenchmark via the command:

SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.DateTimeRebaseBenchmark"

in the environment:

Item Description
Region us-west-2 (Oregon)
Instance r3.xlarge
AMI ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5)
Java OpenJDK 1.8.0_242-8u242/11.0.6+10

@SparkQA
Copy link

SparkQA commented Mar 31, 2020

Test build #120610 has finished for PR 28076 at commit c71829d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Could you file a JIRA before making a PR? You can make a DRAFT PR feature of GitHub for WIP.

@MaxGekk MaxGekk changed the title [WIP][SQL] Benchmark dates/timestamps rebasing in ORC datasource [WIP][SPARK-31311][SQL][TESTS] Benchmark date-time rebasing in ORC datasource Mar 31, 2020
@MaxGekk
Copy link
Member Author

MaxGekk commented Mar 31, 2020

@cloud-fan @dongjoon-hyun @HyukjinKwon Please, take a look at the PR.

@SparkQA
Copy link

SparkQA commented Mar 31, 2020

Test build #120639 has finished for PR 28076 at commit 33694d2.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 31, 2020

Test build #120645 has finished for PR 28076 at commit bdb7bc7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk MaxGekk changed the title [WIP][SPARK-31311][SQL][TESTS] Benchmark date-time rebasing in ORC datasource [SPARK-31311][SQL][TESTS] Benchmark date-time rebasing in ORC datasource Mar 31, 2020
@cloud-fan
Copy link
Contributor

can you fix the conflicts?

MaxGekk added 2 commits April 1, 2020 08:39
…hmark-orc

# Conflicts:
#	sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DateTimeRebaseBenchmark.scala
@cloud-fan
Copy link
Contributor

it's just benchmark, no need to wait for jenkins.

Thanks, merging to master/3.0!

@cloud-fan cloud-fan closed this in 91af87d Apr 1, 2020
cloud-fan pushed a commit that referenced this pull request Apr 1, 2020
### What changes were proposed in this pull request?
In the PR, I propose to add new benchmarks to `DateTimeRebaseBenchmark` for saving and loading dates/timestamps to/from ORC files. I extracted common code from the benchmark for Parquet datasource and place it to the methods `caseName()` and `getPath()`. Added benchmarks for ORC save/load dates before and after 1582-10-15 because an implementation may have different performance for dates before the Julian calendar cutover day, see #28067 as an example.

### Why are the changes needed?
To have the base line for future optimizations of `fromJavaDate()`/`toJavaDate()` and `toJavaTimestamp()`/`fromJavaTimestamp()` in `DateTimeUtils`. The methods are used while saving/loading dates/timestamps by ORC datasource.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By running the updated benchmark `DateTimeRebaseBenchmark` via the command:
```
SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.DateTimeRebaseBenchmark"
```
in the environment:

| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) |
| Java | OpenJDK 1.8.0_242-8u242/11.0.6+10 |

Closes #28076 from MaxGekk/rebase-benchmark-orc.

Lead-authored-by: Max Gekk <[email protected]>
Co-authored-by: Maxim Gekk <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 91af87d)
Signed-off-by: Wenchen Fan <[email protected]>
@SparkQA
Copy link

SparkQA commented Apr 1, 2020

Test build #120665 has finished for PR 28076 at commit 91d1133.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
### What changes were proposed in this pull request?
In the PR, I propose to add new benchmarks to `DateTimeRebaseBenchmark` for saving and loading dates/timestamps to/from ORC files. I extracted common code from the benchmark for Parquet datasource and place it to the methods `caseName()` and `getPath()`. Added benchmarks for ORC save/load dates before and after 1582-10-15 because an implementation may have different performance for dates before the Julian calendar cutover day, see apache#28067 as an example.

### Why are the changes needed?
To have the base line for future optimizations of `fromJavaDate()`/`toJavaDate()` and `toJavaTimestamp()`/`fromJavaTimestamp()` in `DateTimeUtils`. The methods are used while saving/loading dates/timestamps by ORC datasource.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By running the updated benchmark `DateTimeRebaseBenchmark` via the command:
```
SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.DateTimeRebaseBenchmark"
```
in the environment:

| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) |
| Java | OpenJDK 1.8.0_242-8u242/11.0.6+10 |

Closes apache#28076 from MaxGekk/rebase-benchmark-orc.

Lead-authored-by: Max Gekk <[email protected]>
Co-authored-by: Maxim Gekk <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
@MaxGekk MaxGekk deleted the rebase-benchmark-orc branch June 5, 2020 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants