Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Oct 21, 2020

What changes were proposed in this pull request?

  1. Set the default value for the SQL configs spark.sql.legacy.parquet.int96RebaseModeInWrite and spark.sql.legacy.parquet.int96RebaseModeInRead to EXCEPTION.
  2. Update the SQL migration guide.

Why are the changes needed?

Current default value LEGACY may lead to shifting timestamps in read or in write. We should leave the decision about rebasing to users.

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

By existing test suites like ParquetIOSuite.

@SparkQA
Copy link

SparkQA commented Oct 21, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34708/

@SparkQA
Copy link

SparkQA commented Oct 21, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34708/

@SparkQA
Copy link

SparkQA commented Oct 21, 2020

Test build #130099 has finished for PR 30121 at commit 6c4be00.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk MaxGekk changed the title [WIP][SQL] Set the rebasing mode for parquet INT96 type to EXCEPTION by default [SPARK-33210][SQL] Set the rebasing mode for parquet INT96 type to EXCEPTION by default Oct 21, 2020
@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 21, 2020

@HyukjinKwon @cloud-fan @tomvanbussel @ala Please, review this PR.

// analyze table
sql(s"ANALYZE TABLE $tblName COMPUTE STATISTICS NOSCAN")
var tableStats = getTableStats(tblName)
assert(tableStats.sizeInBytes == 601)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The size of the parquet files increased because we write metadata key org.apache.spark.int96NoRebase.

}
Seq(
"2_4_5" -> successInRead _,
"2_4_5" -> failInRead _,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No info about the writer. We take the mode from the SQL config and fail by default.

@SparkQA
Copy link

SparkQA commented Oct 21, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34714/

@SparkQA
Copy link

SparkQA commented Oct 21, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34714/

@SparkQA
Copy link

SparkQA commented Oct 22, 2020

Test build #130105 has finished for PR 30121 at commit 33bb5c2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in ba13b94 Oct 22, 2020
cloud-fan pushed a commit that referenced this pull request Oct 22, 2020
…ache.spark.int96NoRebase` by `org.apache.spark.legacyINT96`

### What changes were proposed in this pull request?
1. Replace the metadata key `org.apache.spark.int96NoRebase` by `org.apache.spark.legacyINT96`.
2. Change the condition when new key should be saved to parquet metadata: it should be saved when the SQL config `spark.sql.legacy.parquet.int96RebaseModeInWrite` is set to `LEGACY`.
3. Change handling the metadata key in read:
    - If there is no the key in parquet metadata, take the rebase mode from the SQL config: `spark.sql.legacy.parquet.int96RebaseModeInRead`
    - If parquet files were saved by Spark < 3.1.0, use the `LEGACY` rebasing mode for INT96 type.
    - For files written by Spark >= 3.1.0, if the `org.apache.spark.legacyINT96` presents in metadata, perform rebasing otherwise don't.

### Why are the changes needed?
- To not increase parquet size by default when `spark.sql.legacy.parquet.int96RebaseModeInWrite` is `EXCEPTION` after #30121.
- To have the implementation similar to `org.apache.spark.legacyDateTime`
- To minimise impact on other subsystems that are based on file sizes like gathering statistics.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Modified test in `ParquetIOSuite`

Closes #30132 from MaxGekk/int96-flip-metadata-rebase-key.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
a0x8o added a commit to a0x8o/spark that referenced this pull request Oct 22, 2020
…ache.spark.int96NoRebase` by `org.apache.spark.legacyINT96`

### What changes were proposed in this pull request?
1. Replace the metadata key `org.apache.spark.int96NoRebase` by `org.apache.spark.legacyINT96`.
2. Change the condition when new key should be saved to parquet metadata: it should be saved when the SQL config `spark.sql.legacy.parquet.int96RebaseModeInWrite` is set to `LEGACY`.
3. Change handling the metadata key in read:
    - If there is no the key in parquet metadata, take the rebase mode from the SQL config: `spark.sql.legacy.parquet.int96RebaseModeInRead`
    - If parquet files were saved by Spark < 3.1.0, use the `LEGACY` rebasing mode for INT96 type.
    - For files written by Spark >= 3.1.0, if the `org.apache.spark.legacyINT96` presents in metadata, perform rebasing otherwise don't.

### Why are the changes needed?
- To not increase parquet size by default when `spark.sql.legacy.parquet.int96RebaseModeInWrite` is `EXCEPTION` after apache/spark#30121.
- To have the implementation similar to `org.apache.spark.legacyDateTime`
- To minimise impact on other subsystems that are based on file sizes like gathering statistics.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Modified test in `ParquetIOSuite`

Closes #30132 from MaxGekk/int96-flip-metadata-rebase-key.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
@MaxGekk MaxGekk deleted the int96-exception-by-default branch December 11, 2020 20:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants