[SPARK-33160][SQL] Allow saving/loading INT96 in parquet w/o rebasing #30056

MaxGekk · 2020-10-15T13:39:41Z

What changes were proposed in this pull request?

Add the SQL config spark.sql.legacy.parquet.int96RebaseModeInWrite to control timestamps rebasing in saving them as INT96. It supports the same set of values as spark.sql.legacy.parquet.datetimeRebaseModeInWrite but the default value is LEGACY to preserve backward compatibility with Spark <= 3.0.
Write the metadata key org.apache.spark.int96NoRebase to parquet files if the files are saved with spark.sql.legacy.parquet.int96RebaseModeInWrite isn't set to LEGACY.
Add the SQL config spark.sql.legacy.parquet.int96RebaseModeInRead to control loading INT96 timestamps when parquet metadata doesn't have enough info (the org.apache.spark.int96NoRebase tag) about parquet writer - either INT96 was written by Proleptic Gregorian system or some Julian one.
Modified Vectorized and Parquet-mr Readers to support loading/saving INT96 timestamps w/o rebasing depending on SQL config and the metadata tag:
- No rebasing in testing when the SQL config spark.test.forceNoRebase is set to true
- No rebasing if parquet metadata contains the tag org.apache.spark.int96NoRebase. This is the case when parquet files are saved by Spark >= 3.1 with spark.sql.legacy.parquet.int96RebaseModeInWrite is set to CORRECTED, or saved by other systems with the tag org.apache.spark.int96NoRebase.
- With rebasing if parquet files saved by Spark (any versions) without the metadata tag org.apache.spark.int96NoRebase.
- Rebasing depend on the SQL config spark.sql.legacy.parquet.int96RebaseModeInRead if there are no metadata tags org.apache.spark.version and org.apache.spark.int96NoRebase.

New SQL configs are added instead of re-using existing spark.sql.legacy.parquet.datetimeRebaseModeInWrite and spark.sql.legacy.parquet.datetimeRebaseModeInRead because of:

To allow users have different modes for INT96 and for TIMESTAMP_MICROS (MILLIS). For example, users might want to save INT96 as LEGACY but TIMESTAMP_MICROS as CORRECTED.
To have different modes for INT96 and DATE in load (or in save).
To be backward compatible with Spark 2.4. For now, spark.sql.legacy.parquet.datetimeRebaseModeInWrite/Read are set to EXCEPTION by default.

Why are the changes needed?

Parquet spec says that INT96 must be stored as Julian days (see PARQUET-861: Document INT96 timestamps parquet-format#49). This doesn't mean that a reader ( or a writer) is based on the Julian calendar. So, rebasing from Proleptic Gregorian to Julian calendar can be not needed.
Rebasing from/to Julian calendar can loose information because dates in one calendar don't exist in another one. Like 1582-10-04..1582-10-15 exist in Proleptic Gregorian calendar but not in the hybrid calendar (Julian + Gregorian), and visa versa, Julian date 1000-02-29 doesn't exist in Proleptic Gregorian calendar. We should allow users to save timestamps without loosing such dates (rebasing shifts such dates to the next valid date).
It would also make Spark compatible with other systems such as Impala and newer versions of Hive that write proleptic Gregorian based INT96 timestamps.

Does this PR introduce any user-facing change?

It can when spark.sql.legacy.parquet.int96RebaseModeInWrite is set non-default value LEGACY.

How was this patch tested?

Added a test to check the metadata key org.apache.spark.int96NoRebase
By ParquetIOSuite

SparkQA · 2020-10-15T14:35:42Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34444/

SparkQA · 2020-10-15T14:53:39Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34444/

SparkQA · 2020-10-15T16:30:58Z

Test build #129839 has finished for PR 30056 at commit 8a16ca1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-10-15T17:08:30Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34451/

SparkQA · 2020-10-15T17:29:14Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34451/

SparkQA · 2020-10-15T18:05:48Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34454/

SparkQA · 2020-10-15T18:24:22Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34454/

SparkQA · 2020-10-15T20:55:11Z

Test build #129845 has finished for PR 30056 at commit 0bd60e4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-10-15T21:41:56Z

Test build #129848 has finished for PR 30056 at commit 4f62e27.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…ase-int96 # Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala

SparkQA · 2020-10-16T14:06:56Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34502/

SparkQA · 2020-10-16T14:31:16Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34502/

SparkQA · 2020-10-16T17:46:58Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34510/

SparkQA · 2020-10-16T18:05:03Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34510/

SparkQA · 2020-10-16T18:05:47Z

Test build #129897 has finished for PR 30056 at commit fb67c68.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public static class Bitmaps
public static class BitmapArrays
class BlockPushErrorHandler implements ErrorHandler
public class MergedBlockMeta
public class OneForOneBlockPusher
public class FinalizeShuffleMerge extends BlockTransferMessage
public class MergeStatuses extends BlockTransferMessage
public class PushBlockStream extends BlockTransferMessage
case class Lag(input: Expression, inputOffset: Expression, default: Expression)

SparkQA · 2020-10-16T21:22:21Z

Test build #129904 has finished for PR 30056 at commit 76330e2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-10-17T12:37:44Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34550/

SparkQA · 2020-10-17T13:00:00Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34550/

SparkQA · 2020-10-17T13:27:22Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34552/

SparkQA · 2020-10-17T13:55:22Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34552/

SparkQA · 2020-10-17T16:07:56Z

Test build #129945 has finished for PR 30056 at commit ab3fc54.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-10-17T16:53:06Z

Test build #129947 has finished for PR 30056 at commit 1cc8c3e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-10-17T17:59:54Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34554/

SparkQA · 2020-10-17T18:23:45Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34554/

SparkQA · 2020-10-17T21:37:41Z

Test build #129949 has finished for PR 30056 at commit 55fe5cc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2020-10-18T08:17:32Z

@cloud-fan @tomvanbussel @ala @mswit-databricks @HyukjinKwon @bart-samwel May I ask you to review this PR.

SparkQA · 2020-10-18T08:41:16Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34562/

SparkQA · 2020-10-18T09:02:07Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34562/

SparkQA · 2020-10-18T12:42:10Z

Test build #129956 has finished for PR 30056 at commit fab3a86.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tomvanbussel

Big thank you for doing this, Max! LGTM.

tomvanbussel · 2020-10-19T10:28:10Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+      .stringConf
+      .transform(_.toUpperCase(Locale.ROOT))
+      .checkValues(LegacyBehaviorPolicy.values.map(_.toString))
+      .createWithDefault(LegacyBehaviorPolicy.LEGACY.toString)


Could the default be made LegacyBehaviorPolicy.EXCEPTION instead? Could also do this in a follow-up PR if this is controversial.

I am not sure that we can do that in the minor release 3.1. This can break existing apps. @cloud-fan @HyukjinKwon WDYT?

It's a breaking change, but probably is acceptable as it doesn't lead to silent results changing. We need an item in the migration guide though.

Here is the PR #30121 . Will see how many tests this will affect.

HyukjinKwon · 2020-10-20T05:58:45Z

Merged to master.

### What changes were proposed in this pull request? 1. Turn off/on the SQL config `spark.sql.legacy.parquet.int96RebaseModeInWrite` which was added by #30056 in `DateTimeRebaseBenchmark`. The parquet readers should infer correct rebasing mode automatically from metadata. 2. Regenerate benchmark results of `DateTimeRebaseBenchmark` in the environment: | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge (spot instance) | | AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) | | Java | OpenJDK8/11 installed by`sudo add-apt-repository ppa:openjdk-r/ppa` & `sudo apt install openjdk-11-jdk`| ### Why are the changes needed? To have up-to-date info about INT96 performance which is the default type for Catalyst's timestamp type. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By updating benchmark results: ``` $ SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.DateTimeRebaseBenchmark" ``` Closes #30118 from MaxGekk/int96-rebase-benchmark. Authored-by: Max Gekk <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

MaxGekk added 5 commits October 15, 2020 15:39

Add the SQL config spark.sql.legacy.parquet.int96RebaseModeInWrite

b3699d1

Support rebasing INT96 in write

15dec11

Add parquet meta-key org.apache.spark.int96NoRebase

ada4448

Handle Parquet INT96 in newRebaseExceptionInWrite

01c46fc

Move rebasing out of toJulianDay()

8a16ca1

Fix mode in int96RebaseFunc

0bd60e4

Check writing the metadata key 'org.apache.spark.int96NoRebase'

4f62e27

dongjoon-hyun added the SQL label Oct 15, 2020

MaxGekk added 2 commits October 16, 2020 16:13

Refactoring

c91a6d5

Merge remote-tracking branch 'remotes/origin/master' into parquet-reb…

fb67c68

…ase-int96 # Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala

MaxGekk added 3 commits October 16, 2020 19:10

Add the SQL config spark.sql.legacy.parquet.int96RebaseModeInRead

1c1a118

Support Parquet INT96 in newRebaseExceptionInRead()

469d866

Propagate the read SQL config to the readers

76330e2

dongjoon-hyun marked this pull request as draft October 16, 2020 21:38

Merge remote-tracking branch 'origin/master' into parquet-rebase-int96

ab3fc54

Move rebasing out of fromJulianDay()

1cc8c3e

Impl rebaseInt96() in VectorizedColumnReader

5e37d8b

MaxGekk added 2 commits October 17, 2020 20:01

Fix Vectorized Reader

ef298a0

Refactor tests

55fe5cc

Fix ParquetIOSuite

fab3a86

MaxGekk marked this pull request as ready for review October 18, 2020 08:13

MaxGekk changed the title ~~[WIP][SPARK-33160][SQL] Allow saving/loading INT96 in parquet w/o rebasing~~ [SPARK-33160][SQL] Allow saving/loading INT96 in parquet w/o rebasing Oct 18, 2020

tomvanbussel approved these changes Oct 19, 2020

View reviewed changes

ala approved these changes Oct 19, 2020

View reviewed changes

HyukjinKwon approved these changes Oct 20, 2020

View reviewed changes

HyukjinKwon closed this in a44e008 Oct 20, 2020

MaxGekk mentioned this pull request Oct 21, 2020

[SPARK-33160][SQL][FOLLOWUP] Update benchmarks of INT96 type rebasing #30118

Closed

MaxGekk deleted the parquet-rebase-int96 branch December 11, 2020 20:28

MaxGekk mentioned this pull request Feb 13, 2021

[SPARK-34424][SQL][TESTS] Fix failures of HiveOrcHadoopFsRelationSuite #31552

Closed

[SPARK-33160][SQL] Allow saving/loading INT96 in parquet w/o rebasing #30056

[SPARK-33160][SQL] Allow saving/loading INT96 in parquet w/o rebasing #30056

Uh oh!

Conversation

MaxGekk commented Oct 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Oct 15, 2020

Uh oh!

SparkQA commented Oct 15, 2020

Uh oh!

SparkQA commented Oct 15, 2020

Uh oh!

SparkQA commented Oct 15, 2020

Uh oh!

SparkQA commented Oct 15, 2020

Uh oh!

SparkQA commented Oct 15, 2020

Uh oh!

SparkQA commented Oct 15, 2020

Uh oh!

SparkQA commented Oct 15, 2020

Uh oh!

SparkQA commented Oct 15, 2020

Uh oh!

SparkQA commented Oct 16, 2020

Uh oh!

SparkQA commented Oct 16, 2020

Uh oh!

SparkQA commented Oct 16, 2020

Uh oh!

SparkQA commented Oct 16, 2020

Uh oh!

SparkQA commented Oct 16, 2020

Uh oh!

SparkQA commented Oct 16, 2020

Uh oh!

SparkQA commented Oct 17, 2020

Uh oh!

SparkQA commented Oct 17, 2020

Uh oh!

SparkQA commented Oct 17, 2020

Uh oh!

SparkQA commented Oct 17, 2020

Uh oh!

SparkQA commented Oct 17, 2020

Uh oh!

SparkQA commented Oct 17, 2020

Uh oh!

SparkQA commented Oct 17, 2020

Uh oh!

SparkQA commented Oct 17, 2020

Uh oh!

SparkQA commented Oct 17, 2020

Uh oh!

MaxGekk commented Oct 18, 2020

Uh oh!

SparkQA commented Oct 18, 2020

Uh oh!

SparkQA commented Oct 18, 2020

Uh oh!

SparkQA commented Oct 18, 2020

Uh oh!

tomvanbussel left a comment

Choose a reason for hiding this comment

Uh oh!

tomvanbussel Oct 19, 2020

Choose a reason for hiding this comment

Uh oh!

MaxGekk Oct 19, 2020

Choose a reason for hiding this comment

Uh oh!

cloud-fan Oct 21, 2020

Choose a reason for hiding this comment

Uh oh!

MaxGekk Oct 21, 2020

MaxGekk commented Oct 15, 2020 •

edited

Loading