[SPARK-33210][SQL] Set the rebasing mode for parquet INT96 type to `EXCEPTION` by default #30121

MaxGekk · 2020-10-21T17:16:06Z

What changes were proposed in this pull request?

Set the default value for the SQL configs spark.sql.legacy.parquet.int96RebaseModeInWrite and spark.sql.legacy.parquet.int96RebaseModeInRead to EXCEPTION.
Update the SQL migration guide.

Why are the changes needed?

Current default value LEGACY may lead to shifting timestamps in read or in write. We should leave the decision about rebasing to users.

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

By existing test suites like ParquetIOSuite.

SparkQA · 2020-10-21T17:59:47Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34708/

SparkQA · 2020-10-21T18:28:33Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34708/

SparkQA · 2020-10-21T18:35:50Z

Test build #130099 has finished for PR 30121 at commit 6c4be00.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2020-10-21T20:08:48Z

@HyukjinKwon @cloud-fan @tomvanbussel @ala Please, review this PR.

MaxGekk · 2020-10-21T20:10:38Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala

          // analyze table
          sql(s"ANALYZE TABLE $tblName COMPUTE STATISTICS NOSCAN")
          var tableStats = getTableStats(tblName)
-          assert(tableStats.sizeInBytes == 601)


The size of the parquet files increased because we write metadata key org.apache.spark.int96NoRebase.

MaxGekk · 2020-10-21T20:11:22Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala

    }
    Seq(
-      "2_4_5" -> successInRead _,
+      "2_4_5" -> failInRead _,


No info about the writer. We take the mode from the SQL config and fail by default.

SparkQA · 2020-10-21T20:49:29Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34714/

SparkQA · 2020-10-21T21:19:00Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34714/

SparkQA · 2020-10-22T00:38:35Z

Test build #130105 has finished for PR 30121 at commit 33bb5c2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-10-22T03:04:27Z

thanks, merging to master!

…ache.spark.int96NoRebase` by `org.apache.spark.legacyINT96` ### What changes were proposed in this pull request? 1. Replace the metadata key `org.apache.spark.int96NoRebase` by `org.apache.spark.legacyINT96`. 2. Change the condition when new key should be saved to parquet metadata: it should be saved when the SQL config `spark.sql.legacy.parquet.int96RebaseModeInWrite` is set to `LEGACY`. 3. Change handling the metadata key in read: - If there is no the key in parquet metadata, take the rebase mode from the SQL config: `spark.sql.legacy.parquet.int96RebaseModeInRead` - If parquet files were saved by Spark < 3.1.0, use the `LEGACY` rebasing mode for INT96 type. - For files written by Spark >= 3.1.0, if the `org.apache.spark.legacyINT96` presents in metadata, perform rebasing otherwise don't. ### Why are the changes needed? - To not increase parquet size by default when `spark.sql.legacy.parquet.int96RebaseModeInWrite` is `EXCEPTION` after #30121. - To have the implementation similar to `org.apache.spark.legacyDateTime` - To minimise impact on other subsystems that are based on file sizes like gathering statistics. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Modified test in `ParquetIOSuite` Closes #30132 from MaxGekk/int96-flip-metadata-rebase-key. Authored-by: Max Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…ache.spark.int96NoRebase` by `org.apache.spark.legacyINT96` ### What changes were proposed in this pull request? 1. Replace the metadata key `org.apache.spark.int96NoRebase` by `org.apache.spark.legacyINT96`. 2. Change the condition when new key should be saved to parquet metadata: it should be saved when the SQL config `spark.sql.legacy.parquet.int96RebaseModeInWrite` is set to `LEGACY`. 3. Change handling the metadata key in read: - If there is no the key in parquet metadata, take the rebase mode from the SQL config: `spark.sql.legacy.parquet.int96RebaseModeInRead` - If parquet files were saved by Spark < 3.1.0, use the `LEGACY` rebasing mode for INT96 type. - For files written by Spark >= 3.1.0, if the `org.apache.spark.legacyINT96` presents in metadata, perform rebasing otherwise don't. ### Why are the changes needed? - To not increase parquet size by default when `spark.sql.legacy.parquet.int96RebaseModeInWrite` is `EXCEPTION` after apache/spark#30121. - To have the implementation similar to `org.apache.spark.legacyDateTime` - To minimise impact on other subsystems that are based on file sizes like gathering statistics. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Modified test in `ParquetIOSuite` Closes #30132 from MaxGekk/int96-flip-metadata-rebase-key. Authored-by: Max Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

MaxGekk added 2 commits October 21, 2020 19:43

Set configs to EXCEPTION by default

6877ac3

Update the SQL migration guide.

6c4be00

MaxGekk mentioned this pull request Oct 21, 2020

[SPARK-33160][SQL] Allow saving/loading INT96 in parquet w/o rebasing #30056

Closed

MaxGekk added 4 commits October 21, 2020 21:46

Fix ParquetHadoopFsRelationSuite

5a6b92b

Fix StatisticsSuite

4bfb96b

Fix ParquetIOSuite

f0c4ef1

Fix ParquetFilterSuite

33bb5c2

MaxGekk changed the title ~~[WIP][SQL] Set the rebasing mode for parquet INT96 type to EXCEPTION by default~~ [SPARK-33210][SQL] Set the rebasing mode for parquet INT96 type to EXCEPTION by default Oct 21, 2020

MaxGekk commented Oct 21, 2020

View reviewed changes

cloud-fan approved these changes Oct 22, 2020

View reviewed changes

cloud-fan closed this in ba13b94 Oct 22, 2020

MaxGekk mentioned this pull request Oct 22, 2020

[SPARK-33160][SQL][FOLLOWUP] Replace the parquet metadata key org.apache.spark.int96NoRebase by org.apache.spark.legacyINT96 #30132

Closed

MaxGekk deleted the int96-exception-by-default branch December 11, 2020 20:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-33210][SQL] Set the rebasing mode for parquet INT96 type to `EXCEPTION` by default #30121

[SPARK-33210][SQL] Set the rebasing mode for parquet INT96 type to `EXCEPTION` by default #30121

Uh oh!

MaxGekk commented Oct 21, 2020 •

edited

Loading

Uh oh!

SparkQA commented Oct 21, 2020

Uh oh!

SparkQA commented Oct 21, 2020

Uh oh!

SparkQA commented Oct 21, 2020

Uh oh!

MaxGekk commented Oct 21, 2020

Uh oh!

MaxGekk Oct 21, 2020

Uh oh!

MaxGekk Oct 21, 2020

Uh oh!

SparkQA commented Oct 21, 2020

Uh oh!

SparkQA commented Oct 21, 2020

Uh oh!

SparkQA commented Oct 22, 2020

Uh oh!

cloud-fan commented Oct 22, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-33210][SQL] Set the rebasing mode for parquet INT96 type to EXCEPTION by default #30121

[SPARK-33210][SQL] Set the rebasing mode for parquet INT96 type to EXCEPTION by default #30121

Uh oh!

Conversation

MaxGekk commented Oct 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Oct 21, 2020

Uh oh!

SparkQA commented Oct 21, 2020

Uh oh!

SparkQA commented Oct 21, 2020

Uh oh!

MaxGekk commented Oct 21, 2020

Uh oh!

MaxGekk Oct 21, 2020

Choose a reason for hiding this comment

Uh oh!

MaxGekk Oct 21, 2020

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 21, 2020

Uh oh!

SparkQA commented Oct 21, 2020

Uh oh!

SparkQA commented Oct 22, 2020

Uh oh!

cloud-fan commented Oct 22, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-33210][SQL] Set the rebasing mode for parquet INT96 type to `EXCEPTION` by default #30121

[SPARK-33210][SQL] Set the rebasing mode for parquet INT96 type to `EXCEPTION` by default #30121

MaxGekk commented Oct 21, 2020 •

edited

Loading