-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-33160][SQL] Allow saving/loading INT96 in parquet w/o rebasing #30056
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #129839 has finished for PR 30056 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #129845 has finished for PR 30056 at commit
|
|
Test build #129848 has finished for PR 30056 at commit
|
…ase-int96 # Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #129897 has finished for PR 30056 at commit
|
|
Test build #129904 has finished for PR 30056 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #129945 has finished for PR 30056 at commit
|
|
Test build #129947 has finished for PR 30056 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #129949 has finished for PR 30056 at commit
|
|
@cloud-fan @tomvanbussel @ala @mswit-databricks @HyukjinKwon @bart-samwel May I ask you to review this PR. |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #129956 has finished for PR 30056 at commit
|
tomvanbussel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Big thank you for doing this, Max! LGTM.
| .stringConf | ||
| .transform(_.toUpperCase(Locale.ROOT)) | ||
| .checkValues(LegacyBehaviorPolicy.values.map(_.toString)) | ||
| .createWithDefault(LegacyBehaviorPolicy.LEGACY.toString) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the default be made LegacyBehaviorPolicy.EXCEPTION instead? Could also do this in a follow-up PR if this is controversial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure that we can do that in the minor release 3.1. This can break existing apps. @cloud-fan @HyukjinKwon WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a breaking change, but probably is acceptable as it doesn't lead to silent results changing. We need an item in the migration guide though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the PR #30121 . Will see how many tests this will affect.
|
Merged to master. |
### What changes were proposed in this pull request? 1. Turn off/on the SQL config `spark.sql.legacy.parquet.int96RebaseModeInWrite` which was added by #30056 in `DateTimeRebaseBenchmark`. The parquet readers should infer correct rebasing mode automatically from metadata. 2. Regenerate benchmark results of `DateTimeRebaseBenchmark` in the environment: | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge (spot instance) | | AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) | | Java | OpenJDK8/11 installed by`sudo add-apt-repository ppa:openjdk-r/ppa` & `sudo apt install openjdk-11-jdk`| ### Why are the changes needed? To have up-to-date info about INT96 performance which is the default type for Catalyst's timestamp type. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By updating benchmark results: ``` $ SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.DateTimeRebaseBenchmark" ``` Closes #30118 from MaxGekk/int96-rebase-benchmark. Authored-by: Max Gekk <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
What changes were proposed in this pull request?
spark.sql.legacy.parquet.int96RebaseModeInWriteto control timestamps rebasing in saving them as INT96. It supports the same set of values asspark.sql.legacy.parquet.datetimeRebaseModeInWritebut the default value isLEGACYto preserve backward compatibility with Spark <= 3.0.org.apache.spark.int96NoRebaseto parquet files if the files are saved withspark.sql.legacy.parquet.int96RebaseModeInWriteisn't set toLEGACY.spark.sql.legacy.parquet.int96RebaseModeInReadto control loading INT96 timestamps when parquet metadata doesn't have enough info (theorg.apache.spark.int96NoRebasetag) about parquet writer - either INT96 was written by Proleptic Gregorian system or some Julian one.spark.test.forceNoRebaseis set totrueorg.apache.spark.int96NoRebase. This is the case when parquet files are saved by Spark >= 3.1 withspark.sql.legacy.parquet.int96RebaseModeInWriteis set toCORRECTED, or saved by other systems with the tagorg.apache.spark.int96NoRebase.org.apache.spark.int96NoRebase.spark.sql.legacy.parquet.int96RebaseModeInReadif there are no metadata tagsorg.apache.spark.versionandorg.apache.spark.int96NoRebase.New SQL configs are added instead of re-using existing
spark.sql.legacy.parquet.datetimeRebaseModeInWriteandspark.sql.legacy.parquet.datetimeRebaseModeInReadbecause of:spark.sql.legacy.parquet.datetimeRebaseModeInWrite/Readare set toEXCEPTIONby default.Why are the changes needed?
Does this PR introduce any user-facing change?
It can when
spark.sql.legacy.parquet.int96RebaseModeInWriteis set non-default valueLEGACY.How was this patch tested?
org.apache.spark.int96NoRebaseParquetIOSuite