[SPARK-31238][SQL] Rebase dates to/from Julian calendar in write/read for ORC datasource #28016

MaxGekk · 2020-03-25T07:56:02Z

What changes were proposed in this pull request?

This PR (SPARK-31238) aims the followings.

Modified ORC Vectorized Reader, in particular, OrcColumnVector v1.2 and v2.3. After the changes, it uses DateTimeUtils. rebaseJulianToGregorianDays() added by [SPARK-31159][SQL] Rebase date/timestamp from/to Julian calendar in parquet #27915 . The method performs rebasing days from the hybrid calendar (Julian + Gregorian) to Proleptic Gregorian calendar. It builds a local date in the original calendar, extracts date fields year, month and day from the local date, and builds another local date in the target calendar. After that, it calculates days from the epoch 1970-01-01 for the resulted local date.
Introduced rebasing dates while saving ORC files, in particular, I modified OrcShimUtils. getDateWritable v1.2 and v2.3, and returned DaysWritable instead of Hive's DateWritable. The DaysWritable class was added by the PR [SPARK-31076][SQL][FOLLOWUP] Incapsulate date rebasing to DaysWritable #27890 (and fixed by [SPARK-31195][SQL] Correct and reuse days rebase functions of DateTimeUtils in DaysWritable #27962). I moved DaysWritable from sql/hive to sql/core to re-use it in ORC datasource.

Why are the changes needed?

For the backward compatibility with Spark 2.4 and earlier versions. The changes allow users to read dates/timestamps saved by previous version, and get the same result.

Does this PR introduce any user-facing change?

Yes. Before the changes, loading the date 1200-01-01 saved by Spark 2.4.5 returns the following:

scala> spark.read.orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc").show(false)
+----------+
|dt        |
+----------+
|1200-01-08|
+----------+

After the changes

scala> spark.read.orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc").show(false)
+----------+
|dt        |
+----------+
|1200-01-01|
+----------+

How was this patch tested?

By running OrcSourceSuite and HiveOrcSourceSuite.
Add new test SPARK-31238: compatibility with Spark 2.4 in reading dates to OrcSuite which reads an ORC file saved by Spark 2.4.5 via the commands:

$ export TZ="America/Los_Angeles"

scala> sql("select cast('1200-01-01' as date) dt").write.mode("overwrite").orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc")
scala> spark.read.orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc").show(false)
+----------+
|dt        |
+----------+
|1200-01-01|
+----------+

Add round trip test SPARK-31238: rebasing dates in write. The test SPARK-31238: compatibility with Spark 2.4 in reading dates confirms rebasing in read. So, we can check rebasing in write.

SparkQA · 2020-03-25T09:23:19Z

Test build #120324 has finished for PR 28016 at commit ecef05d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-03-25T15:31:52Z

So, does this happen only in vectorized reader, @MaxGekk ?

SparkQA · 2020-03-25T16:54:22Z

Test build #120345 has finished for PR 28016 at commit 47d8588.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2020-03-25T17:26:30Z

So, does this happen only in vectorized reader

@dongjoon-hyun Correct, the regular reader uses DateTimeUtils.fromJavaDate:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcDeserializer.scala

Line 111 in 300ec1a

    
           updater.setInt(ordinal, DateTimeUtils.fromJavaDate(OrcShimUtils.getSqlDate(value)))

And the PRs #27807 and #27980 introduced date rebasing in fromJavaDate().

Also, I fixed ORC writer, and added a round trip test for both vectorized and non-vectorized readers.

MaxGekk · 2020-03-25T17:28:15Z

I think the failure doesn't related to my changes: #28016 (comment) . I will update the PR desciption shortly, so the PR will be ready for review. /cc @cloud-fan

MaxGekk · 2020-03-25T17:28:32Z

jenkins, retest this, please

MaxGekk · 2020-03-25T18:15:31Z

@cloud-fan @dongjoon-hyun @bersprockets Please, review this PR.

MaxGekk · 2020-03-25T18:16:42Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DaysWritable.scala

    var julianDays: Int)
  extends DateWritable {

+  def this() = this(0, 0)


I assume that gregorianDays and julianDays will be set later via the set method.

SparkQA · 2020-03-25T22:13:33Z

Test build #120367 has finished for PR 28016 at commit 47d8588.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-03-25T22:51:36Z

Retest this please.

dongjoon-hyun

+1, LGTM (Pending Jenkins with hive1.2 to make it sure.)
Thank you for swift fix, @MaxGekk !

SparkQA · 2020-03-25T22:59:23Z

Test build #120379 has finished for PR 28016 at commit 47d8588.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-03-25T23:23:53Z

Hi, @MaxGekk .
Could you fix the compilation failures in hive-1.2 profile?

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DaysWritable.scala

Due to hive-1.2 build failure.

cloud-fan · 2020-03-26T05:42:38Z

sql/core/v1.2/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnVector.java

  public int getInt(int rowId) {
-    return (int) longData.vector[getRowIndex(rowId)];
+    int index = getRowIndex(rowId);
+    int value = (int) longData.vector[index];


nit: int value = (int) longData.vector[getRowIndex(rowId)];

sql/core/v1.2/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnVector.java

cloud-fan · 2020-03-26T05:44:12Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala

 import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.catalyst.util._
+import org.apache.spark.sql.execution.datasources.DaysWritable


unnecessary change

If I remove it, the build fails with:

Error:(1012, 44) not found: type DaysWritable private def getDateWritable(value: Any): DaysWritable =

I moved DaysWritable from sql/hive to sql/core to reuse it in ORC.

SparkQA · 2020-03-26T07:05:02Z

Test build #120403 has finished for PR 28016 at commit 3b1b791.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2020-03-26T07:09:19Z

jenkins, retest this, please

SparkQA · 2020-03-26T12:35:02Z

Test build #120408 has finished for PR 28016 at commit 3b1b791.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-03-26T15:13:12Z

Test build #120416 has finished for PR 28016 at commit c8a897a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

Please move sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DaysWritable.scala to sql/core/v2.3/src/main/scala/org/apache/spark/sql/execution/datasources/orc/DaysWritable.scala.

dongjoon-hyun · 2020-03-26T19:27:26Z

sql/core/v1.2/src/main/scala/org/apache/spark/sql/execution/datasources/orc/DaysWritable.scala

+ * This is a clone of `org.apache.spark.sql.execution.datasources.DaysWritable`.
+ * The class is cloned because Hive ORC v1.2 uses different `DateWritable`:
+ *   - v1.2: `org.apache.orc.storage.serde2.io.DateWritable`
+ *   - v2.3 and `HiveInspectors`: `org.apache.hadoop.hive.serde2.io.DateWritable`


We don't need the above 4 line comments because v1.2 and v2.3 folder structure already is designed for that.

If I move it to 2.3, and enable hive-1.2, HiveInspectors will use DaysWritable from v1.2 which is wrong or I miss something in your proposal?

Got it. Let me check again.

? Sorry, why do you move this to 2.3?

What I ask is keep this AS-IS, and move the other DaysWritable.

I'm checking again from my side.

This is what I tried MaxGekk#26, and it fails when I enable hive-1.2:

build/sbt -Phive-1.2 clean package

Feel free to open a PR for this PR if you mean something different.

dongjoon-hyun · 2020-03-26T19:27:58Z

sql/core/v2.3/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcShimUtils.scala

 import org.apache.hadoop.hive.serde2.io.{DateWritable, HiveDecimalWritable}

 import org.apache.spark.sql.catalyst.expressions.SpecializedGetters
+import org.apache.spark.sql.execution.datasources.DaysWritable


After moving DaysWritable.scala to here. I guess we don't need this line. Please try to remove.

MaxGekk · 2020-03-26T19:29:20Z

Please move ... to v2.3

Why? DaysWritable is used in sql/hive, in HiveInspectors, just in case.

dongjoon-hyun · 2020-03-26T19:42:57Z

Ur, are you asking why we keep v1.2 and v2.3 structure?
In this PR, you are proposing the following weird structure. And, I'm not sure if you check what happens at DaysWritable when you use hive-1.2 profile.

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DaysWritable.scala
sql/core/v1.2/src/main/scala/org/apache/spark/sql/execution/datasources/orc/DaysWritable.scala

dongjoon-hyun · 2020-03-26T20:12:28Z

My bad. I overlooked the difference in HiveInspectors in hive-1.2 profile.

dongjoon-hyun

+1, LGTM. Thank you, @MaxGekk and @cloud-fan .
Merged to master/3.0.

… for ORC datasource ### What changes were proposed in this pull request? This PR (SPARK-31238) aims the followings. 1. Modified ORC Vectorized Reader, in particular, OrcColumnVector v1.2 and v2.3. After the changes, it uses `DateTimeUtils. rebaseJulianToGregorianDays()` added by #27915 . The method performs rebasing days from the hybrid calendar (Julian + Gregorian) to Proleptic Gregorian calendar. It builds a local date in the original calendar, extracts date fields `year`, `month` and `day` from the local date, and builds another local date in the target calendar. After that, it calculates days from the epoch `1970-01-01` for the resulted local date. 2. Introduced rebasing dates while saving ORC files, in particular, I modified `OrcShimUtils. getDateWritable` v1.2 and v2.3, and returned `DaysWritable` instead of Hive's `DateWritable`. The `DaysWritable` class was added by the PR #27890 (and fixed by #27962). I moved `DaysWritable` from `sql/hive` to `sql/core` to re-use it in ORC datasource. ### Why are the changes needed? For the backward compatibility with Spark 2.4 and earlier versions. The changes allow users to read dates/timestamps saved by previous version, and get the same result. ### Does this PR introduce any user-facing change? Yes. Before the changes, loading the date `1200-01-01` saved by Spark 2.4.5 returns the following: ```scala scala> spark.read.orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc").show(false) +----------+ |dt | +----------+ |1200-01-08| +----------+ ``` After the changes ```scala scala> spark.read.orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc").show(false) +----------+ |dt | +----------+ |1200-01-01| +----------+ ``` ### How was this patch tested? - By running `OrcSourceSuite` and `HiveOrcSourceSuite`. - Add new test `SPARK-31238: compatibility with Spark 2.4 in reading dates` to `OrcSuite` which reads an ORC file saved by Spark 2.4.5 via the commands: ```shell $ export TZ="America/Los_Angeles" ``` ```scala scala> sql("select cast('1200-01-01' as date) dt").write.mode("overwrite").orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc") scala> spark.read.orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc").show(false) +----------+ |dt | +----------+ |1200-01-01| +----------+ ``` - Add round trip test `SPARK-31238: rebasing dates in write`. The test `SPARK-31238: compatibility with Spark 2.4 in reading dates` confirms rebasing in read. So, we can check rebasing in write. Closes #28016 from MaxGekk/rebase-date-orc. Authored-by: Maxim Gekk <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit d72ec85) Signed-off-by: Dongjoon Hyun <[email protected]>

dongjoon-hyun · 2020-03-27T23:50:11Z

Hi, @MaxGekk .
Unfortunately, this seems to cause a consistent failure at all Maven Jenkins jobs in both master and branch-3.0. Could you take a look at this?

- SPARK-31238: compatibility with Spark 2.4 in reading dates *** FAILED ***
  java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: jar:file:/home/jenkins/workspace/spark-branch-3.0-test-maven-hadoop-2.7-hive-1.2/sql/core/target/spark-sql_2.12-3.0.0-SNAPSHOT-tests.jar!/test-data/before_1582_date_v2_4.snappy.orc
  at org.apache.hadoop.fs.Path.initialize(Path.java:205)
  at org.apache.hadoop.fs.Path.<init>(Path.java:171)
  at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)

dongjoon-hyun · 2020-03-28T01:06:51Z

I made a follow-up.

[SPARK-31238][SPARK-31284][TEST][FOLLOWUP] Fix readResourceOrcFile to create a local file from resource #28059

… for ORC datasource ### What changes were proposed in this pull request? This PR (SPARK-31238) aims the followings. 1. Modified ORC Vectorized Reader, in particular, OrcColumnVector v1.2 and v2.3. After the changes, it uses `DateTimeUtils. rebaseJulianToGregorianDays()` added by apache#27915 . The method performs rebasing days from the hybrid calendar (Julian + Gregorian) to Proleptic Gregorian calendar. It builds a local date in the original calendar, extracts date fields `year`, `month` and `day` from the local date, and builds another local date in the target calendar. After that, it calculates days from the epoch `1970-01-01` for the resulted local date. 2. Introduced rebasing dates while saving ORC files, in particular, I modified `OrcShimUtils. getDateWritable` v1.2 and v2.3, and returned `DaysWritable` instead of Hive's `DateWritable`. The `DaysWritable` class was added by the PR apache#27890 (and fixed by apache#27962). I moved `DaysWritable` from `sql/hive` to `sql/core` to re-use it in ORC datasource. ### Why are the changes needed? For the backward compatibility with Spark 2.4 and earlier versions. The changes allow users to read dates/timestamps saved by previous version, and get the same result. ### Does this PR introduce any user-facing change? Yes. Before the changes, loading the date `1200-01-01` saved by Spark 2.4.5 returns the following: ```scala scala> spark.read.orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc").show(false) +----------+ |dt | +----------+ |1200-01-08| +----------+ ``` After the changes ```scala scala> spark.read.orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc").show(false) +----------+ |dt | +----------+ |1200-01-01| +----------+ ``` ### How was this patch tested? - By running `OrcSourceSuite` and `HiveOrcSourceSuite`. - Add new test `SPARK-31238: compatibility with Spark 2.4 in reading dates` to `OrcSuite` which reads an ORC file saved by Spark 2.4.5 via the commands: ```shell $ export TZ="America/Los_Angeles" ``` ```scala scala> sql("select cast('1200-01-01' as date) dt").write.mode("overwrite").orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc") scala> spark.read.orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc").show(false) +----------+ |dt | +----------+ |1200-01-01| +----------+ ``` - Add round trip test `SPARK-31238: rebasing dates in write`. The test `SPARK-31238: compatibility with Spark 2.4 in reading dates` confirms rebasing in read. So, we can check rebasing in write. Closes apache#28016 from MaxGekk/rebase-date-orc. Authored-by: Maxim Gekk <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

MaxGekk added 3 commits March 25, 2020 10:31

Add a test

a2a081a

Fix v2.3 OrcColumnVector

121b4c6

Fix v1.2 OrcColumnVector

ecef05d

Rebase in write

47d8588

MaxGekk changed the title ~~[WIP][SPARK-31238][SQL] Rebase dates in ORC Vectorized Reader~~ [WIP][SPARK-31238][SQL] Rebase dates to/from Julian calendar in write/read for ORC datasource Mar 25, 2020

MaxGekk changed the title ~~[WIP][SPARK-31238][SQL] Rebase dates to/from Julian calendar in write/read for ORC datasource~~ [SPARK-31238][SQL] Rebase dates to/from Julian calendar in write/read for ORC datasource Mar 25, 2020

MaxGekk commented Mar 25, 2020

View reviewed changes

dongjoon-hyun added the SQL label Mar 25, 2020

dongjoon-hyun changed the title ~~[SPARK-31238][SQL] Rebase dates to/from Julian calendar in write/read for ORC datasource~~ [SPARK-31238][SQL][test-hive1.2] Rebase dates to/from Julian calendar in write/read for ORC datasource Mar 25, 2020

dongjoon-hyun previously approved these changes Mar 25, 2020

View reviewed changes

dongjoon-hyun reviewed Mar 25, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DaysWritable.scala Show resolved Hide resolved

dongjoon-hyun self-requested a review March 25, 2020 23:35

cloud-fan reviewed Mar 26, 2020

View reviewed changes

sql/core/v1.2/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnVector.java Show resolved Hide resolved

cloud-fan reviewed Mar 26, 2020

View reviewed changes

MaxGekk added 2 commits March 26, 2020 09:20

Copy DaysWritable to v1.2

29d7966

Remove index variable in OrcColumnVector

3b1b791

Added a comment to DaysWritable v1.2

c8a897a

dongjoon-hyun reviewed Mar 26, 2020

View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-31238][SQL][test-hive1.2] Rebase dates to/from Julian calendar in write/read for ORC datasource~~ [SPARK-31238][SQL] Rebase dates to/from Julian calendar in write/read for ORC datasource Mar 26, 2020

dongjoon-hyun approved these changes Mar 26, 2020

View reviewed changes

dongjoon-hyun closed this in d72ec85 Mar 26, 2020

MaxGekk deleted the rebase-date-orc branch June 5, 2020 19:46

[SPARK-31238][SQL] Rebase dates to/from Julian calendar in write/read for ORC datasource #28016

[SPARK-31238][SQL] Rebase dates to/from Julian calendar in write/read for ORC datasource #28016

Uh oh!

Conversation

MaxGekk commented Mar 25, 2020 • edited by dongjoon-hyun Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Mar 25, 2020

Uh oh!

dongjoon-hyun commented Mar 25, 2020

Uh oh!

SparkQA commented Mar 25, 2020

Uh oh!

MaxGekk commented Mar 25, 2020

Uh oh!

MaxGekk commented Mar 25, 2020

Uh oh!

MaxGekk commented Mar 25, 2020

Uh oh!

MaxGekk commented Mar 25, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 25, 2020

Uh oh!

dongjoon-hyun commented Mar 25, 2020

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 25, 2020

Uh oh!

dongjoon-hyun commented Mar 25, 2020

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 26, 2020

Uh oh!

MaxGekk commented Mar 26, 2020

Uh oh!

SparkQA commented Mar 26, 2020

Uh oh!

SparkQA commented Mar 26, 2020

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGekk commented Mar 26, 2020

Uh oh!

dongjoon-hyun commented Mar 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Mar 26, 2020

Uh oh!

MaxGekk commented Mar 25, 2020 •

edited by dongjoon-hyun

Loading

dongjoon-hyun commented Mar 26, 2020 •

edited

Loading

dongjoon-hyun commented Mar 27, 2020 •

edited

Loading