Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Mar 25, 2020

What changes were proposed in this pull request?

This PR (SPARK-31238) aims the followings.

  1. Modified ORC Vectorized Reader, in particular, OrcColumnVector v1.2 and v2.3. After the changes, it uses DateTimeUtils. rebaseJulianToGregorianDays() added by [SPARK-31159][SQL] Rebase date/timestamp from/to Julian calendar in parquet #27915 . The method performs rebasing days from the hybrid calendar (Julian + Gregorian) to Proleptic Gregorian calendar. It builds a local date in the original calendar, extracts date fields year, month and day from the local date, and builds another local date in the target calendar. After that, it calculates days from the epoch 1970-01-01 for the resulted local date.
  2. Introduced rebasing dates while saving ORC files, in particular, I modified OrcShimUtils. getDateWritable v1.2 and v2.3, and returned DaysWritable instead of Hive's DateWritable. The DaysWritable class was added by the PR [SPARK-31076][SQL][FOLLOWUP] Incapsulate date rebasing to DaysWritable #27890 (and fixed by [SPARK-31195][SQL] Correct and reuse days rebase functions of DateTimeUtils in DaysWritable #27962). I moved DaysWritable from sql/hive to sql/core to re-use it in ORC datasource.

Why are the changes needed?

For the backward compatibility with Spark 2.4 and earlier versions. The changes allow users to read dates/timestamps saved by previous version, and get the same result.

Does this PR introduce any user-facing change?

Yes. Before the changes, loading the date 1200-01-01 saved by Spark 2.4.5 returns the following:

scala> spark.read.orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc").show(false)
+----------+
|dt        |
+----------+
|1200-01-08|
+----------+

After the changes

scala> spark.read.orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc").show(false)
+----------+
|dt        |
+----------+
|1200-01-01|
+----------+

How was this patch tested?

  • By running OrcSourceSuite and HiveOrcSourceSuite.
  • Add new test SPARK-31238: compatibility with Spark 2.4 in reading dates to OrcSuite which reads an ORC file saved by Spark 2.4.5 via the commands:
$ export TZ="America/Los_Angeles"
scala> sql("select cast('1200-01-01' as date) dt").write.mode("overwrite").orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc")
scala> spark.read.orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc").show(false)
+----------+
|dt        |
+----------+
|1200-01-01|
+----------+
  • Add round trip test SPARK-31238: rebasing dates in write. The test SPARK-31238: compatibility with Spark 2.4 in reading dates confirms rebasing in read. So, we can check rebasing in write.

@SparkQA
Copy link

SparkQA commented Mar 25, 2020

Test build #120324 has finished for PR 28016 at commit ecef05d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

So, does this happen only in vectorized reader, @MaxGekk ?

@SparkQA
Copy link

SparkQA commented Mar 25, 2020

Test build #120345 has finished for PR 28016 at commit 47d8588.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Mar 25, 2020

So, does this happen only in vectorized reader

@dongjoon-hyun Correct, the regular reader uses DateTimeUtils.fromJavaDate:

updater.setInt(ordinal, DateTimeUtils.fromJavaDate(OrcShimUtils.getSqlDate(value)))

And the PRs #27807 and #27980 introduced date rebasing in fromJavaDate().

Also, I fixed ORC writer, and added a round trip test for both vectorized and non-vectorized readers.

@MaxGekk
Copy link
Member Author

MaxGekk commented Mar 25, 2020

I think the failure doesn't related to my changes: #28016 (comment) . I will update the PR desciption shortly, so the PR will be ready for review. /cc @cloud-fan

@MaxGekk
Copy link
Member Author

MaxGekk commented Mar 25, 2020

jenkins, retest this, please

@MaxGekk MaxGekk changed the title [WIP][SPARK-31238][SQL] Rebase dates in ORC Vectorized Reader [WIP][SPARK-31238][SQL] Rebase dates to/from Julian calendar in write/read for ORC datasource Mar 25, 2020
@MaxGekk MaxGekk changed the title [WIP][SPARK-31238][SQL] Rebase dates to/from Julian calendar in write/read for ORC datasource [SPARK-31238][SQL] Rebase dates to/from Julian calendar in write/read for ORC datasource Mar 25, 2020
@MaxGekk
Copy link
Member Author

MaxGekk commented Mar 25, 2020

@cloud-fan @dongjoon-hyun @bersprockets Please, review this PR.

var julianDays: Int)
extends DateWritable {

def this() = this(0, 0)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that gregorianDays and julianDays will be set later via the set method.

@SparkQA
Copy link

SparkQA commented Mar 25, 2020

Test build #120367 has finished for PR 28016 at commit 47d8588.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-31238][SQL] Rebase dates to/from Julian calendar in write/read for ORC datasource [SPARK-31238][SQL][test-hive1.2] Rebase dates to/from Julian calendar in write/read for ORC datasource Mar 25, 2020
@dongjoon-hyun
Copy link
Member

Retest this please.

dongjoon-hyun
dongjoon-hyun previously approved these changes Mar 25, 2020
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM (Pending Jenkins with hive1.2 to make it sure.)
Thank you for swift fix, @MaxGekk !

@SparkQA
Copy link

SparkQA commented Mar 25, 2020

Test build #120379 has finished for PR 28016 at commit 47d8588.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Hi, @MaxGekk .
Could you fix the compilation failures in hive-1.2 profile?

@dongjoon-hyun dongjoon-hyun self-requested a review March 25, 2020 23:35
@dongjoon-hyun dongjoon-hyun dismissed their stale review March 26, 2020 04:27

Due to hive-1.2 build failure.

public int getInt(int rowId) {
return (int) longData.vector[getRowIndex(rowId)];
int index = getRowIndex(rowId);
int value = (int) longData.vector[index];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: int value = (int) longData.vector[getRowIndex(rowId)];

import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.catalyst.expressions._
import org.apache.spark.sql.catalyst.util._
import org.apache.spark.sql.execution.datasources.DaysWritable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary change

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remove it, the build fails with:

Error:(1012, 44) not found: type DaysWritable
  private def getDateWritable(value: Any): DaysWritable =

I moved DaysWritable from sql/hive to sql/core to reuse it in ORC.

@SparkQA
Copy link

SparkQA commented Mar 26, 2020

Test build #120403 has finished for PR 28016 at commit 3b1b791.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Mar 26, 2020

jenkins, retest this, please

@SparkQA
Copy link

SparkQA commented Mar 26, 2020

Test build #120408 has finished for PR 28016 at commit 3b1b791.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 26, 2020

Test build #120416 has finished for PR 28016 at commit c8a897a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DaysWritable.scala to sql/core/v2.3/src/main/scala/org/apache/spark/sql/execution/datasources/orc/DaysWritable.scala.

* This is a clone of `org.apache.spark.sql.execution.datasources.DaysWritable`.
* The class is cloned because Hive ORC v1.2 uses different `DateWritable`:
* - v1.2: `org.apache.orc.storage.serde2.io.DateWritable`
* - v2.3 and `HiveInspectors`: `org.apache.hadoop.hive.serde2.io.DateWritable`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need the above 4 line comments because v1.2 and v2.3 folder structure already is designed for that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I move it to 2.3, and enable hive-1.2, HiveInspectors will use DaysWritable from v1.2 which is wrong or I miss something in your proposal?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Let me check again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

? Sorry, why do you move this to 2.3?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I ask is keep this AS-IS, and move the other DaysWritable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm checking again from my side.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I tried MaxGekk#26, and it fails when I enable hive-1.2:

build/sbt -Phive-1.2 clean package

Feel free to open a PR for this PR if you mean something different.

import org.apache.hadoop.hive.serde2.io.{DateWritable, HiveDecimalWritable}

import org.apache.spark.sql.catalyst.expressions.SpecializedGetters
import org.apache.spark.sql.execution.datasources.DaysWritable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After moving DaysWritable.scala to here. I guess we don't need this line. Please try to remove.

@MaxGekk
Copy link
Member Author

MaxGekk commented Mar 26, 2020

Please move ... to v2.3

Why? DaysWritable is used in sql/hive, in HiveInspectors, just in case.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Mar 26, 2020

Ur, are you asking why we keep v1.2 and v2.3 structure?
In this PR, you are proposing the following weird structure. And, I'm not sure if you check what happens at DaysWritable when you use hive-1.2 profile.

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DaysWritable.scala
sql/core/v1.2/src/main/scala/org/apache/spark/sql/execution/datasources/orc/DaysWritable.scala

@dongjoon-hyun
Copy link
Member

My bad. I overlooked the difference in HiveInspectors in hive-1.2 profile.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-31238][SQL][test-hive1.2] Rebase dates to/from Julian calendar in write/read for ORC datasource [SPARK-31238][SQL] Rebase dates to/from Julian calendar in write/read for ORC datasource Mar 26, 2020
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @MaxGekk and @cloud-fan .
Merged to master/3.0.

dongjoon-hyun pushed a commit that referenced this pull request Mar 26, 2020
… for ORC datasource

### What changes were proposed in this pull request?

This PR (SPARK-31238) aims the followings.
1. Modified ORC Vectorized Reader, in particular, OrcColumnVector v1.2 and v2.3. After the changes, it uses `DateTimeUtils. rebaseJulianToGregorianDays()` added by #27915 . The method performs rebasing days from the hybrid calendar (Julian + Gregorian) to Proleptic Gregorian calendar. It builds a local date in the original calendar, extracts date fields `year`, `month` and `day` from the local date, and builds another local date in the target calendar. After that, it calculates days from the epoch `1970-01-01` for the resulted local date.
2. Introduced rebasing dates while saving ORC files, in particular, I modified `OrcShimUtils. getDateWritable` v1.2 and v2.3, and returned `DaysWritable` instead of Hive's `DateWritable`. The `DaysWritable` class was added by the PR #27890 (and fixed by #27962). I moved `DaysWritable` from `sql/hive` to `sql/core` to re-use it in ORC datasource.

### Why are the changes needed?
For the backward compatibility with Spark 2.4 and earlier versions. The changes allow users to read dates/timestamps saved by previous version, and get the same result.

### Does this PR introduce any user-facing change?
Yes. Before the changes, loading the date `1200-01-01` saved by Spark 2.4.5 returns the following:
```scala
scala> spark.read.orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc").show(false)
+----------+
|dt        |
+----------+
|1200-01-08|
+----------+
```
After the changes
```scala
scala> spark.read.orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc").show(false)
+----------+
|dt        |
+----------+
|1200-01-01|
+----------+
```

### How was this patch tested?
- By running `OrcSourceSuite` and `HiveOrcSourceSuite`.
- Add new test `SPARK-31238: compatibility with Spark 2.4 in reading dates` to `OrcSuite` which reads an ORC file saved by Spark 2.4.5 via the commands:
```shell
$ export TZ="America/Los_Angeles"
```
```scala
scala> sql("select cast('1200-01-01' as date) dt").write.mode("overwrite").orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc")
scala> spark.read.orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc").show(false)
+----------+
|dt        |
+----------+
|1200-01-01|
+----------+
```
- Add round trip test `SPARK-31238: rebasing dates in write`. The test `SPARK-31238: compatibility with Spark 2.4 in reading dates` confirms rebasing in read. So, we can check rebasing in write.

Closes #28016 from MaxGekk/rebase-date-orc.

Authored-by: Maxim Gekk <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit d72ec85)
Signed-off-by: Dongjoon Hyun <[email protected]>
@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Mar 27, 2020

Hi, @MaxGekk .
Unfortunately, this seems to cause a consistent failure at all Maven Jenkins jobs in both master and branch-3.0. Could you take a look at this?

- SPARK-31238: compatibility with Spark 2.4 in reading dates *** FAILED ***
  java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: jar:file:/home/jenkins/workspace/spark-branch-3.0-test-maven-hadoop-2.7-hive-1.2/sql/core/target/spark-sql_2.12-3.0.0-SNAPSHOT-tests.jar!/test-data/before_1582_date_v2_4.snappy.orc
  at org.apache.hadoop.fs.Path.initialize(Path.java:205)
  at org.apache.hadoop.fs.Path.<init>(Path.java:171)
  at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)

@dongjoon-hyun
Copy link
Member

sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
… for ORC datasource

### What changes were proposed in this pull request?

This PR (SPARK-31238) aims the followings.
1. Modified ORC Vectorized Reader, in particular, OrcColumnVector v1.2 and v2.3. After the changes, it uses `DateTimeUtils. rebaseJulianToGregorianDays()` added by apache#27915 . The method performs rebasing days from the hybrid calendar (Julian + Gregorian) to Proleptic Gregorian calendar. It builds a local date in the original calendar, extracts date fields `year`, `month` and `day` from the local date, and builds another local date in the target calendar. After that, it calculates days from the epoch `1970-01-01` for the resulted local date.
2. Introduced rebasing dates while saving ORC files, in particular, I modified `OrcShimUtils. getDateWritable` v1.2 and v2.3, and returned `DaysWritable` instead of Hive's `DateWritable`. The `DaysWritable` class was added by the PR apache#27890 (and fixed by apache#27962). I moved `DaysWritable` from `sql/hive` to `sql/core` to re-use it in ORC datasource.

### Why are the changes needed?
For the backward compatibility with Spark 2.4 and earlier versions. The changes allow users to read dates/timestamps saved by previous version, and get the same result.

### Does this PR introduce any user-facing change?
Yes. Before the changes, loading the date `1200-01-01` saved by Spark 2.4.5 returns the following:
```scala
scala> spark.read.orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc").show(false)
+----------+
|dt        |
+----------+
|1200-01-08|
+----------+
```
After the changes
```scala
scala> spark.read.orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc").show(false)
+----------+
|dt        |
+----------+
|1200-01-01|
+----------+
```

### How was this patch tested?
- By running `OrcSourceSuite` and `HiveOrcSourceSuite`.
- Add new test `SPARK-31238: compatibility with Spark 2.4 in reading dates` to `OrcSuite` which reads an ORC file saved by Spark 2.4.5 via the commands:
```shell
$ export TZ="America/Los_Angeles"
```
```scala
scala> sql("select cast('1200-01-01' as date) dt").write.mode("overwrite").orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc")
scala> spark.read.orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc").show(false)
+----------+
|dt        |
+----------+
|1200-01-01|
+----------+
```
- Add round trip test `SPARK-31238: rebasing dates in write`. The test `SPARK-31238: compatibility with Spark 2.4 in reading dates` confirms rebasing in read. So, we can check rebasing in write.

Closes apache#28016 from MaxGekk/rebase-date-orc.

Authored-by: Maxim Gekk <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
@MaxGekk MaxGekk deleted the rebase-date-orc branch June 5, 2020 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants