Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Oct 15, 2020

What changes were proposed in this pull request?

  1. Add the SQL config spark.sql.legacy.parquet.int96RebaseModeInWrite to control timestamps rebasing in saving them as INT96. It supports the same set of values as spark.sql.legacy.parquet.datetimeRebaseModeInWrite but the default value is LEGACY to preserve backward compatibility with Spark <= 3.0.
  2. Write the metadata key org.apache.spark.int96NoRebase to parquet files if the files are saved with spark.sql.legacy.parquet.int96RebaseModeInWrite isn't set to LEGACY.
  3. Add the SQL config spark.sql.legacy.parquet.int96RebaseModeInRead to control loading INT96 timestamps when parquet metadata doesn't have enough info (the org.apache.spark.int96NoRebase tag) about parquet writer - either INT96 was written by Proleptic Gregorian system or some Julian one.
  4. Modified Vectorized and Parquet-mr Readers to support loading/saving INT96 timestamps w/o rebasing depending on SQL config and the metadata tag:
    • No rebasing in testing when the SQL config spark.test.forceNoRebase is set to true
    • No rebasing if parquet metadata contains the tag org.apache.spark.int96NoRebase. This is the case when parquet files are saved by Spark >= 3.1 with spark.sql.legacy.parquet.int96RebaseModeInWrite is set to CORRECTED, or saved by other systems with the tag org.apache.spark.int96NoRebase.
    • With rebasing if parquet files saved by Spark (any versions) without the metadata tag org.apache.spark.int96NoRebase.
    • Rebasing depend on the SQL config spark.sql.legacy.parquet.int96RebaseModeInRead if there are no metadata tags org.apache.spark.version and org.apache.spark.int96NoRebase.

New SQL configs are added instead of re-using existing spark.sql.legacy.parquet.datetimeRebaseModeInWrite and spark.sql.legacy.parquet.datetimeRebaseModeInRead because of:

  • To allow users have different modes for INT96 and for TIMESTAMP_MICROS (MILLIS). For example, users might want to save INT96 as LEGACY but TIMESTAMP_MICROS as CORRECTED.
  • To have different modes for INT96 and DATE in load (or in save).
  • To be backward compatible with Spark 2.4. For now, spark.sql.legacy.parquet.datetimeRebaseModeInWrite/Read are set to EXCEPTION by default.

Why are the changes needed?

  1. Parquet spec says that INT96 must be stored as Julian days (see PARQUET-861: Document INT96 timestamps parquet-format#49). This doesn't mean that a reader ( or a writer) is based on the Julian calendar. So, rebasing from Proleptic Gregorian to Julian calendar can be not needed.
  2. Rebasing from/to Julian calendar can loose information because dates in one calendar don't exist in another one. Like 1582-10-04..1582-10-15 exist in Proleptic Gregorian calendar but not in the hybrid calendar (Julian + Gregorian), and visa versa, Julian date 1000-02-29 doesn't exist in Proleptic Gregorian calendar. We should allow users to save timestamps without loosing such dates (rebasing shifts such dates to the next valid date).
  3. It would also make Spark compatible with other systems such as Impala and newer versions of Hive that write proleptic Gregorian based INT96 timestamps.

Does this PR introduce any user-facing change?

It can when spark.sql.legacy.parquet.int96RebaseModeInWrite is set non-default value LEGACY.

How was this patch tested?

  • Added a test to check the metadata key org.apache.spark.int96NoRebase
  • By ParquetIOSuite

@SparkQA
Copy link

SparkQA commented Oct 15, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34444/

@SparkQA
Copy link

SparkQA commented Oct 15, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34444/

@SparkQA
Copy link

SparkQA commented Oct 15, 2020

Test build #129839 has finished for PR 30056 at commit 8a16ca1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 15, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34451/

@SparkQA
Copy link

SparkQA commented Oct 15, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34451/

@SparkQA
Copy link

SparkQA commented Oct 15, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34454/

@SparkQA
Copy link

SparkQA commented Oct 15, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34454/

@SparkQA
Copy link

SparkQA commented Oct 15, 2020

Test build #129845 has finished for PR 30056 at commit 0bd60e4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 15, 2020

Test build #129848 has finished for PR 30056 at commit 4f62e27.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

…ase-int96

# Conflicts:
#	sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala
@SparkQA
Copy link

SparkQA commented Oct 16, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34502/

@SparkQA
Copy link

SparkQA commented Oct 16, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34502/

@SparkQA
Copy link

SparkQA commented Oct 16, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34510/

@SparkQA
Copy link

SparkQA commented Oct 16, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34510/

@SparkQA
Copy link

SparkQA commented Oct 16, 2020

Test build #129897 has finished for PR 30056 at commit fb67c68.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public static class Bitmaps
  • public static class BitmapArrays
  • class BlockPushErrorHandler implements ErrorHandler
  • public class MergedBlockMeta
  • public class OneForOneBlockPusher
  • public class FinalizeShuffleMerge extends BlockTransferMessage
  • public class MergeStatuses extends BlockTransferMessage
  • public class PushBlockStream extends BlockTransferMessage
  • case class Lag(input: Expression, inputOffset: Expression, default: Expression)

@SparkQA
Copy link

SparkQA commented Oct 16, 2020

Test build #129904 has finished for PR 30056 at commit 76330e2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun dongjoon-hyun marked this pull request as draft October 16, 2020 21:38
@SparkQA
Copy link

SparkQA commented Oct 17, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34550/

@SparkQA
Copy link

SparkQA commented Oct 17, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34550/

@SparkQA
Copy link

SparkQA commented Oct 17, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34552/

@SparkQA
Copy link

SparkQA commented Oct 17, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34552/

@SparkQA
Copy link

SparkQA commented Oct 17, 2020

Test build #129945 has finished for PR 30056 at commit ab3fc54.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 17, 2020

Test build #129947 has finished for PR 30056 at commit 1cc8c3e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 17, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34554/

@SparkQA
Copy link

SparkQA commented Oct 17, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34554/

@SparkQA
Copy link

SparkQA commented Oct 17, 2020

Test build #129949 has finished for PR 30056 at commit 55fe5cc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk MaxGekk marked this pull request as ready for review October 18, 2020 08:13
@MaxGekk MaxGekk changed the title [WIP][SPARK-33160][SQL] Allow saving/loading INT96 in parquet w/o rebasing [SPARK-33160][SQL] Allow saving/loading INT96 in parquet w/o rebasing Oct 18, 2020
@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 18, 2020

@cloud-fan @tomvanbussel @ala @mswit-databricks @HyukjinKwon @bart-samwel May I ask you to review this PR.

@SparkQA
Copy link

SparkQA commented Oct 18, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34562/

@SparkQA
Copy link

SparkQA commented Oct 18, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34562/

@SparkQA
Copy link

SparkQA commented Oct 18, 2020

Test build #129956 has finished for PR 30056 at commit fab3a86.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@tomvanbussel tomvanbussel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big thank you for doing this, Max! LGTM.

.stringConf
.transform(_.toUpperCase(Locale.ROOT))
.checkValues(LegacyBehaviorPolicy.values.map(_.toString))
.createWithDefault(LegacyBehaviorPolicy.LEGACY.toString)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the default be made LegacyBehaviorPolicy.EXCEPTION instead? Could also do this in a follow-up PR if this is controversial.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure that we can do that in the minor release 3.1. This can break existing apps. @cloud-fan @HyukjinKwon WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a breaking change, but probably is acceptable as it doesn't lead to silent results changing. We need an item in the migration guide though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the PR #30121 . Will see how many tests this will affect.

@HyukjinKwon
Copy link
Member

Merged to master.

HyukjinKwon pushed a commit that referenced this pull request Oct 22, 2020
### What changes were proposed in this pull request?
1. Turn off/on the SQL config `spark.sql.legacy.parquet.int96RebaseModeInWrite` which was added by #30056 in `DateTimeRebaseBenchmark`. The parquet readers should infer correct rebasing mode automatically from metadata.
2. Regenerate benchmark results of `DateTimeRebaseBenchmark` in the environment:

| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge (spot instance) |
| AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) |
| Java | OpenJDK8/11 installed by`sudo add-apt-repository ppa:openjdk-r/ppa` & `sudo apt install openjdk-11-jdk`|

### Why are the changes needed?
To have up-to-date info about INT96 performance which is the default type for Catalyst's timestamp type.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By updating benchmark results:
```
$ SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.DateTimeRebaseBenchmark"
```

Closes #30118 from MaxGekk/int96-rebase-benchmark.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
@MaxGekk MaxGekk deleted the parquet-rebase-int96 branch December 11, 2020 20:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants