-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-33210][SQL] Set the rebasing mode for parquet INT96 type to EXCEPTION by default
#30121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #130099 has finished for PR 30121 at commit
|
EXCEPTION by defaultEXCEPTION by default
|
@HyukjinKwon @cloud-fan @tomvanbussel @ala Please, review this PR. |
| // analyze table | ||
| sql(s"ANALYZE TABLE $tblName COMPUTE STATISTICS NOSCAN") | ||
| var tableStats = getTableStats(tblName) | ||
| assert(tableStats.sizeInBytes == 601) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The size of the parquet files increased because we write metadata key org.apache.spark.int96NoRebase.
| } | ||
| Seq( | ||
| "2_4_5" -> successInRead _, | ||
| "2_4_5" -> failInRead _, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No info about the writer. We take the mode from the SQL config and fail by default.
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #130105 has finished for PR 30121 at commit
|
|
thanks, merging to master! |
…ache.spark.int96NoRebase` by `org.apache.spark.legacyINT96`
### What changes were proposed in this pull request?
1. Replace the metadata key `org.apache.spark.int96NoRebase` by `org.apache.spark.legacyINT96`.
2. Change the condition when new key should be saved to parquet metadata: it should be saved when the SQL config `spark.sql.legacy.parquet.int96RebaseModeInWrite` is set to `LEGACY`.
3. Change handling the metadata key in read:
- If there is no the key in parquet metadata, take the rebase mode from the SQL config: `spark.sql.legacy.parquet.int96RebaseModeInRead`
- If parquet files were saved by Spark < 3.1.0, use the `LEGACY` rebasing mode for INT96 type.
- For files written by Spark >= 3.1.0, if the `org.apache.spark.legacyINT96` presents in metadata, perform rebasing otherwise don't.
### Why are the changes needed?
- To not increase parquet size by default when `spark.sql.legacy.parquet.int96RebaseModeInWrite` is `EXCEPTION` after #30121.
- To have the implementation similar to `org.apache.spark.legacyDateTime`
- To minimise impact on other subsystems that are based on file sizes like gathering statistics.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Modified test in `ParquetIOSuite`
Closes #30132 from MaxGekk/int96-flip-metadata-rebase-key.
Authored-by: Max Gekk <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
…ache.spark.int96NoRebase` by `org.apache.spark.legacyINT96`
### What changes were proposed in this pull request?
1. Replace the metadata key `org.apache.spark.int96NoRebase` by `org.apache.spark.legacyINT96`.
2. Change the condition when new key should be saved to parquet metadata: it should be saved when the SQL config `spark.sql.legacy.parquet.int96RebaseModeInWrite` is set to `LEGACY`.
3. Change handling the metadata key in read:
- If there is no the key in parquet metadata, take the rebase mode from the SQL config: `spark.sql.legacy.parquet.int96RebaseModeInRead`
- If parquet files were saved by Spark < 3.1.0, use the `LEGACY` rebasing mode for INT96 type.
- For files written by Spark >= 3.1.0, if the `org.apache.spark.legacyINT96` presents in metadata, perform rebasing otherwise don't.
### Why are the changes needed?
- To not increase parquet size by default when `spark.sql.legacy.parquet.int96RebaseModeInWrite` is `EXCEPTION` after apache/spark#30121.
- To have the implementation similar to `org.apache.spark.legacyDateTime`
- To minimise impact on other subsystems that are based on file sizes like gathering statistics.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Modified test in `ParquetIOSuite`
Closes #30132 from MaxGekk/int96-flip-metadata-rebase-key.
Authored-by: Max Gekk <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
spark.sql.legacy.parquet.int96RebaseModeInWriteandspark.sql.legacy.parquet.int96RebaseModeInReadtoEXCEPTION.Why are the changes needed?
Current default value
LEGACYmay lead to shifting timestamps in read or in write. We should leave the decision about rebasing to users.Does this PR introduce any user-facing change?
Yes
How was this patch tested?
By existing test suites like
ParquetIOSuite.