-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-31311][SQL][TESTS] Benchmark date-time rebasing in ORC datasource #28076
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #120610 has finished for PR 28076 at commit
|
|
Could you file a JIRA before making a PR? You can make a |
|
@cloud-fan @dongjoon-hyun @HyukjinKwon Please, take a look at the PR. |
|
Test build #120639 has finished for PR 28076 at commit
|
|
Test build #120645 has finished for PR 28076 at commit
|
|
can you fix the conflicts? |
…hmark-orc # Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DateTimeRebaseBenchmark.scala
|
it's just benchmark, no need to wait for jenkins. Thanks, merging to master/3.0! |
### What changes were proposed in this pull request? In the PR, I propose to add new benchmarks to `DateTimeRebaseBenchmark` for saving and loading dates/timestamps to/from ORC files. I extracted common code from the benchmark for Parquet datasource and place it to the methods `caseName()` and `getPath()`. Added benchmarks for ORC save/load dates before and after 1582-10-15 because an implementation may have different performance for dates before the Julian calendar cutover day, see #28067 as an example. ### Why are the changes needed? To have the base line for future optimizations of `fromJavaDate()`/`toJavaDate()` and `toJavaTimestamp()`/`fromJavaTimestamp()` in `DateTimeUtils`. The methods are used while saving/loading dates/timestamps by ORC datasource. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By running the updated benchmark `DateTimeRebaseBenchmark` via the command: ``` SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.DateTimeRebaseBenchmark" ``` in the environment: | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge | | AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) | | Java | OpenJDK 1.8.0_242-8u242/11.0.6+10 | Closes #28076 from MaxGekk/rebase-benchmark-orc. Lead-authored-by: Max Gekk <[email protected]> Co-authored-by: Maxim Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 91af87d) Signed-off-by: Wenchen Fan <[email protected]>
|
Test build #120665 has finished for PR 28076 at commit
|
### What changes were proposed in this pull request? In the PR, I propose to add new benchmarks to `DateTimeRebaseBenchmark` for saving and loading dates/timestamps to/from ORC files. I extracted common code from the benchmark for Parquet datasource and place it to the methods `caseName()` and `getPath()`. Added benchmarks for ORC save/load dates before and after 1582-10-15 because an implementation may have different performance for dates before the Julian calendar cutover day, see apache#28067 as an example. ### Why are the changes needed? To have the base line for future optimizations of `fromJavaDate()`/`toJavaDate()` and `toJavaTimestamp()`/`fromJavaTimestamp()` in `DateTimeUtils`. The methods are used while saving/loading dates/timestamps by ORC datasource. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By running the updated benchmark `DateTimeRebaseBenchmark` via the command: ``` SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.DateTimeRebaseBenchmark" ``` in the environment: | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge | | AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) | | Java | OpenJDK 1.8.0_242-8u242/11.0.6+10 | Closes apache#28076 from MaxGekk/rebase-benchmark-orc. Lead-authored-by: Max Gekk <[email protected]> Co-authored-by: Maxim Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
In the PR, I propose to add new benchmarks to
DateTimeRebaseBenchmarkfor saving and loading dates/timestamps to/from ORC files. I extracted common code from the benchmark for Parquet datasource and place it to the methodscaseName()andgetPath(). Added benchmarks for ORC save/load dates before and after 1582-10-15 because an implementation may have different performance for dates before the Julian calendar cutover day, see #28067 as an example.Why are the changes needed?
To have the base line for future optimizations of
fromJavaDate()/toJavaDate()andtoJavaTimestamp()/fromJavaTimestamp()inDateTimeUtils. The methods are used while saving/loading dates/timestamps by ORC datasource.Does this PR introduce any user-facing change?
No
How was this patch tested?
By running the updated benchmark
DateTimeRebaseBenchmarkvia the command:in the environment: