[SPARK-34404][SQL] Add new Avro datasource options to control datetime rebasing in read #31529

MaxGekk · 2021-02-08T20:17:39Z

What changes were proposed in this pull request?

In the PR, I propose new option datetimeRebaseMode for the Avro datasource. The option influences on loading ancient dates and timestamps column values from avro files.

The option supports the same values as the SQL config spark.sql.legacy.avro.datetimeRebaseModeInRead namely;

"LEGACY", when an option is set to this value, Spark rebases dates/timestamps from the legacy hybrid calendar (Julian + Gregorian) to the Proleptic Gregorian calendar.
"CORRECTED", dates/timestamps are read AS IS from avro files.
"EXCEPTION", when it is set as an option value, Spark will fail the reading if it sees ancient dates/timestamps that are ambiguous between the two calendars.

Why are the changes needed?

New options will allow to load avro files from at least two sources in different rebasing modes in the same query. For instance:

val df1 = spark.read.option("datetimeRebaseMode", "legacy").format("avro").load(folder1)
val df2 = spark.read.option("datetimeRebaseMode", "corrected").format("avro").load(folder2)
df1.join(df2, ...)

Before the changes, it is impossible because the SQL config spark.sql.legacy.avro.datetimeRebaseModeInRead influences on both reads.

Mixing of Dataset/DataFrame and RDD APIs should become possible. Since SQL configs are not propagated through RDDs, the following code fails on ancient timestamps:

spark.conf.set("spark.sql.legacy.avro.datetimeRebaseModeInRead", "legacy")
spark.read.format("avro").load(folder).distinct.rdd.collect()

Does this PR introduce any user-facing change?

No.

How was this patch tested?

By running the modified test suites:

$ build/sbt "test:testOnly *AvroV1Suite"
$ build/sbt "test:testOnly *AvroV2Suite"

MaxGekk · 2021-02-08T20:22:57Z

@cloud-fan @gengliangwang @HyukjinKwon Could you review this PR, please.

SparkQA · 2021-02-08T20:48:03Z

Test build #135045 has finished for PR 31529 at commit 79a8a59.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-02-08T21:20:32Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39628/

SparkQA · 2021-02-08T22:37:46Z

Test build #135047 has finished for PR 31529 at commit 1e3e524.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-02-08T22:52:51Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39628/

MaxGekk · 2021-02-09T00:15:26Z

jenkins, retest this, please

SparkQA · 2021-02-09T00:53:44Z

Test build #135051 has finished for PR 31529 at commit 1e3e524.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-02-09T01:04:10Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39634/

SparkQA · 2021-02-09T01:31:21Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39634/

SparkQA · 2021-02-09T20:03:59Z

Test build #135080 has finished for PR 31529 at commit c930fd9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-02-09T20:20:43Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39662/

SparkQA · 2021-02-09T20:48:59Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39662/

MaxGekk · 2021-02-10T05:47:20Z

@cloud-fan @gengliangwang @HyukjinKwon This PR is a companion to #31489, and should solve the same issues. Any concerns about it?

cloud-fan · 2021-02-10T06:23:07Z

thanks, merging to master!

### What changes were proposed in this pull request? Mention the DS options introduced by #31529 and by #31489 in `SparkUpgradeException`. ### Why are the changes needed? To improve user experience with Spark SQL. Before the changes, the error message recommends to set SQL configs but the configs cannot help in the some situations (see the PRs for more details). ### Does this PR introduce _any_ user-facing change? Yes. After the changes, the error message is: _org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: reading dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z from Parquet files can be ambiguous, as the files may be written by Spark 2.x or legacy versions of Hive, which uses a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian calendar. See more details in SPARK-31404. You can set the SQL config 'spark.sql.legacy.parquet.datetimeRebaseModeInRead' or the datasource option 'datetimeRebaseMode' to 'LEGACY' to rebase the datetime values w.r.t. the calendar difference during reading. To read the datetime values as it is, set the SQL config 'spark.sql.legacy.parquet.datetimeRebaseModeInRead' or the datasource option 'datetimeRebaseMode' to 'CORRECTED'._ ### How was this patch tested? 1. By checking coding style: `./dev/scalastyle` 2. By running the related test suite: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *ParquetRebaseDatetimeV1Suite" ``` Closes #31562 from MaxGekk/rebase-upgrade-exception. Authored-by: Max Gekk <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

… options and SQL configs ### What changes were proposed in this pull request? In the PR, I propose to update the Spark SQL guide about the SQL configs that are related to datetime rebasing: - spark.sql.parquet.int96RebaseModeInWrite - spark.sql.parquet.datetimeRebaseModeInWrite - spark.sql.parquet.int96RebaseModeInRead - spark.sql.parquet.datetimeRebaseModeInRead - spark.sql.avro.datetimeRebaseModeInWrite - spark.sql.avro.datetimeRebaseModeInRead Parquet options added by #31489: - datetimeRebaseMode - int96RebaseMode and Avro options added by #31529: - datetimeRebaseMode <img width="998" alt="Screenshot 2021-02-17 at 21 42 09" src="https://user-images.githubusercontent.com/1580697/108252043-3afb8900-7169-11eb-8568-511e21fa7f78.png"> ### Why are the changes needed? To inform users about supported DS options and SQL configs. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By generating the doc and manually checking: ``` $ SKIP_API=1 SKIP_SCALADOC=1 SKIP_PYTHONDOC=1 SKIP_RDOC=1 jekyll serve --watch ``` Closes #31564 from MaxGekk/doc-rebase-options. Authored-by: Max Gekk <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

MaxGekk added 2 commits February 8, 2021 22:51

Support the datetimeRebaseMode option

bbdf9d3

Fix imports

79a8a59

github-actions bot added AVRO SQL labels Feb 8, 2021

Fix NPE

1e3e524

Align to the current approach

c930fd9

cloud-fan approved these changes Feb 10, 2021

View reviewed changes

cloud-fan closed this in c082c53 Feb 10, 2021

This was referenced Feb 14, 2021

[SPARK-34434][SQL] Mention DS rebase options in SparkUpgradeException #31562

Closed

[SPARK-34437][SQL][DOCS] Update Spark SQL guide about the rebasing DS options and SQL configs #31564

Closed

[SPARK-34404][SQL] Add new Avro datasource options to control datetime rebasing in read #31529

[SPARK-34404][SQL] Add new Avro datasource options to control datetime rebasing in read #31529

Uh oh!

Conversation

MaxGekk commented Feb 8, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

MaxGekk commented Feb 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Feb 8, 2021

Uh oh!

SparkQA commented Feb 8, 2021

Uh oh!

SparkQA commented Feb 8, 2021

Uh oh!

SparkQA commented Feb 8, 2021

Uh oh!

MaxGekk commented Feb 9, 2021

Uh oh!

SparkQA commented Feb 9, 2021

Uh oh!

SparkQA commented Feb 9, 2021

Uh oh!

SparkQA commented Feb 9, 2021

Uh oh!

SparkQA commented Feb 9, 2021

Uh oh!

SparkQA commented Feb 9, 2021

Uh oh!

SparkQA commented Feb 9, 2021

Uh oh!

MaxGekk commented Feb 10, 2021

Uh oh!

cloud-fan commented Feb 10, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MaxGekk commented Feb 8, 2021 •

edited

Loading