[SPARK-34377][SQL] Add new parquet datasource options to control datetime rebasing in read #31489

MaxGekk · 2021-02-05T12:30:47Z

What changes were proposed in this pull request?

In the PR, I propose new options for the Parquet datasource:

datetimeRebaseMode
int96RebaseMode

Both options influence on loading ancient dates and timestamps column values from parquet files. The datetimeRebaseMode option impacts on loading values of the DATE, TIMESTAMP_MICROS and TIMESTAMP_MILLIS types, int96RebaseMode impacts on loading of INT96 timestamps.

The options support the same values as the SQL configs spark.sql.legacy.parquet.datetimeRebaseModeInRead and spark.sql.legacy.parquet.int96RebaseModeInRead namely;

"LEGACY", when an option is set to this value, Spark rebases dates/timestamps from the legacy hybrid calendar (Julian + Gregorian) to the Proleptic Gregorian calendar.
"CORRECTED", dates/timestamps are read AS IS from parquet files.
"EXCEPTION", when it is set as an option value, Spark will fail the reading if it sees ancient dates/timestamps that are ambiguous between the two calendars.

Why are the changes needed?

New options will allow to load parquet files from at least two sources in different rebasing modes in the same query. For instance:

val df1 = spark.read.option("datetimeRebaseMode", "legacy").parquet(folder1)
val df2 = spark.read.option("datetimeRebaseMode", "corrected").parquet(folder2)
df1.join(df2, ...)

Before the changes, it is impossible because the SQL config spark.sql.legacy.parquet.datetimeRebaseModeInRead influences on both reads.

Mixing of Dataset/DataFrame and RDD APIs should become possible. Since SQL configs are not propagated through RDDs, the following code fails on ancient timestamps:

spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "legacy")
spark.read.parquet(folder).distinct.rdd.collect()

Does this PR introduce any user-facing change?

No.

How was this patch tested?

By running the modified test suites:

$ build/sbt "sql/test:testOnly *ParquetRebaseDatetimeV1Suite"
$ build/sbt "sql/test:testOnly *ParquetRebaseDatetimeV2Suite"

MaxGekk · 2021-02-05T12:32:44Z

...st/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRebaseDatetimeSuite.scala

          // The file metadata indicates if it needs rebase or not, so we can always get the
          // correct result regardless of the "rebase mode" config.
-          Seq(LEGACY, CORRECTED, EXCEPTION).foreach { mode =>
-            withSQLConf(SQLConf.LEGACY_AVRO_REBASE_MODE_IN_READ.key -> mode.toString) {


Fixed the wrong SQL conf: LEGACY_AVRO_REBASE_MODE_IN_READ

MaxGekk · 2021-02-05T15:02:58Z

@cloud-fan @tomvanbussel @mswit-databricks Could you review this PR, please.

SparkQA · 2021-02-05T17:39:11Z

Test build #134934 has finished for PR 31489 at commit f898288.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-02-05T20:23:09Z

Test build #134940 has started for PR 31489 at commit eb77fc6.

SparkQA · 2021-02-05T21:05:46Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39523/

SparkQA · 2021-02-05T21:39:36Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39523/

SparkQA · 2021-02-06T07:51:49Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39534/

SparkQA · 2021-02-06T08:20:24Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39534/

SparkQA · 2021-02-06T11:32:47Z

Test build #134951 has finished for PR 31489 at commit f33b5a8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

# Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRebaseDatetimeSuite.scala

SparkQA · 2021-02-07T09:10:54Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39569/

SparkQA · 2021-02-07T09:29:15Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39569/

SparkQA · 2021-02-07T12:34:47Z

Test build #134986 has finished for PR 31489 at commit ebc7298.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-02-08T13:23:25Z

...la/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala

    partitionSchema: StructType,
-    filters: Array[Filter]) extends FilePartitionReaderFactory with Logging {
+    filters: Array[Filter],
+    parquetOptions: ParquetOptions) extends FilePartitionReaderFactory with Logging {


Does it really work? The 2 fields of ParquetOptions are transient, and become null after (de)serialization.

ah nvm, we read the confs and put it in val.

cloud-fan · 2021-02-08T13:28:52Z

thanks, merging to master!

### What changes were proposed in this pull request? Mention the DS options introduced by #31529 and by #31489 in `SparkUpgradeException`. ### Why are the changes needed? To improve user experience with Spark SQL. Before the changes, the error message recommends to set SQL configs but the configs cannot help in the some situations (see the PRs for more details). ### Does this PR introduce _any_ user-facing change? Yes. After the changes, the error message is: _org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: reading dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z from Parquet files can be ambiguous, as the files may be written by Spark 2.x or legacy versions of Hive, which uses a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian calendar. See more details in SPARK-31404. You can set the SQL config 'spark.sql.legacy.parquet.datetimeRebaseModeInRead' or the datasource option 'datetimeRebaseMode' to 'LEGACY' to rebase the datetime values w.r.t. the calendar difference during reading. To read the datetime values as it is, set the SQL config 'spark.sql.legacy.parquet.datetimeRebaseModeInRead' or the datasource option 'datetimeRebaseMode' to 'CORRECTED'._ ### How was this patch tested? 1. By checking coding style: `./dev/scalastyle` 2. By running the related test suite: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *ParquetRebaseDatetimeV1Suite" ``` Closes #31562 from MaxGekk/rebase-upgrade-exception. Authored-by: Max Gekk <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

… options and SQL configs ### What changes were proposed in this pull request? In the PR, I propose to update the Spark SQL guide about the SQL configs that are related to datetime rebasing: - spark.sql.parquet.int96RebaseModeInWrite - spark.sql.parquet.datetimeRebaseModeInWrite - spark.sql.parquet.int96RebaseModeInRead - spark.sql.parquet.datetimeRebaseModeInRead - spark.sql.avro.datetimeRebaseModeInWrite - spark.sql.avro.datetimeRebaseModeInRead Parquet options added by #31489: - datetimeRebaseMode - int96RebaseMode and Avro options added by #31529: - datetimeRebaseMode <img width="998" alt="Screenshot 2021-02-17 at 21 42 09" src="https://user-images.githubusercontent.com/1580697/108252043-3afb8900-7169-11eb-8568-511e21fa7f78.png"> ### Why are the changes needed? To inform users about supported DS options and SQL configs. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By generating the doc and manually checking: ``` $ SKIP_API=1 SKIP_SCALADOC=1 SKIP_PYTHONDOC=1 SKIP_RDOC=1 jekyll serve --watch ``` Closes #31564 from MaxGekk/doc-rebase-options. Authored-by: Max Gekk <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

…time rebasing in read ### What changes were proposed in this pull request? In the PR, I propose new options for the Parquet datasource: 1. `datetimeRebaseMode` 2. `int96RebaseMode` Both options influence on loading ancient dates and timestamps column values from parquet files. The `datetimeRebaseMode` option impacts on loading values of the `DATE`, `TIMESTAMP_MICROS` and `TIMESTAMP_MILLIS` types, `int96RebaseMode` impacts on loading of `INT96` timestamps. The options support the same values as the SQL configs `spark.sql.legacy.parquet.datetimeRebaseModeInRead` and `spark.sql.legacy.parquet.int96RebaseModeInRead` namely; - `"LEGACY"`, when an option is set to this value, Spark rebases dates/timestamps from the legacy hybrid calendar (Julian + Gregorian) to the Proleptic Gregorian calendar. - `"CORRECTED"`, dates/timestamps are read AS IS from parquet files. - `"EXCEPTION"`, when it is set as an option value, Spark will fail the reading if it sees ancient dates/timestamps that are ambiguous between the two calendars. ### Why are the changes needed? 1. New options will allow to load parquet files from at least two sources in different rebasing modes in the same query. For instance: ```scala val df1 = spark.read.option("datetimeRebaseMode", "legacy").parquet(folder1) val df2 = spark.read.option("datetimeRebaseMode", "corrected").parquet(folder2) df1.join(df2, ...) ``` Before the changes, it is impossible because the SQL config `spark.sql.legacy.parquet.datetimeRebaseModeInRead` influences on both reads. 2. Mixing of Dataset/DataFrame and RDD APIs should become possible. Since SQL configs are not propagated through RDDs, the following code fails on ancient timestamps: ```scala spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "legacy") spark.read.parquet(folder).distinct.rdd.collect() ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By running the modified test suites: ``` $ build/sbt "sql/test:testOnly *ParquetRebaseDatetimeV1Suite" $ build/sbt "sql/test:testOnly *ParquetRebaseDatetimeV2Suite" ``` Closes apache#31489 from MaxGekk/parquet-rebase-options. Authored-by: Max Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit a854906)

MaxGekk added 10 commits February 4, 2021 22:01

Add parquet options names.

984d9ea

Add functions to ParquetOptions

2b6bf15

Merge remote-tracking branch 'origin/master' into parquet-rebase-options

83eb2d8

Test options

1f8d429

Support options in v1

bf2b77e

Support options in v2

db16d69

Fix imports

1bb9980

Minor

018b171

typo

34f88de

Fix test

f898288

MaxGekk commented Feb 5, 2021

View reviewed changes

github-actions bot added the SQL label Feb 5, 2021

MaxGekk changed the title ~~[WIP][SQL] Support parquet datasource options to control datetime rebasing in read~~ [SPARK-34377][SQL] Support parquet datasource options to control datetime rebasing in read Feb 5, 2021

MaxGekk changed the title ~~[SPARK-34377][SQL] Support parquet datasource options to control datetime rebasing in read~~ [SPARK-34377][SQL] Add new parquet datasource options to control datetime rebasing in read Feb 5, 2021

Update docs

eb77fc6

github-actions bot added CORE PYTHON STRUCTURED STREAMING labels Feb 5, 2021

Trigger build

f33b5a8

HyukjinKwon reviewed Feb 7, 2021

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala Outdated Show resolved Hide resolved

MaxGekk added 2 commits February 7, 2021 10:26

Merge remote-tracking branch 'origin/master' into parquet-rebase-options

ae8288c

# Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRebaseDatetimeSuite.scala

Document the default values

ebc7298

cloud-fan reviewed Feb 8, 2021

View reviewed changes

cloud-fan approved these changes Feb 8, 2021

View reviewed changes

cloud-fan closed this in a854906 Feb 8, 2021

[SPARK-34377][SQL] Add new parquet datasource options to control datetime rebasing in read #31489

[SPARK-34377][SQL] Add new parquet datasource options to control datetime rebasing in read #31489

Uh oh!

Conversation

MaxGekk commented Feb 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

MaxGekk Feb 5, 2021

Choose a reason for hiding this comment

Uh oh!

MaxGekk commented Feb 5, 2021

Uh oh!

SparkQA commented Feb 5, 2021

Uh oh!

SparkQA commented Feb 5, 2021

Uh oh!

SparkQA commented Feb 5, 2021

Uh oh!

SparkQA commented Feb 5, 2021

Uh oh!

SparkQA commented Feb 6, 2021

Uh oh!

SparkQA commented Feb 6, 2021

Uh oh!

SparkQA commented Feb 6, 2021

Uh oh!

Uh oh!

SparkQA commented Feb 7, 2021

Uh oh!

SparkQA commented Feb 7, 2021

Uh oh!

SparkQA commented Feb 7, 2021

Uh oh!

cloud-fan Feb 8, 2021

Choose a reason for hiding this comment

Uh oh!

cloud-fan Feb 8, 2021

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Feb 8, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

MaxGekk commented Feb 5, 2021 •

edited

Loading