-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-31238][SQL] Rebase dates to/from Julian calendar in write/read for ORC datasource #28016
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #120324 has finished for PR 28016 at commit
|
|
So, does this happen only in vectorized reader, @MaxGekk ? |
|
Test build #120345 has finished for PR 28016 at commit
|
@dongjoon-hyun Correct, the regular reader uses spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcDeserializer.scala Line 111 in 300ec1a
And the PRs #27807 and #27980 introduced date rebasing in Also, I fixed ORC writer, and added a round trip test for both vectorized and non-vectorized readers. |
|
I think the failure doesn't related to my changes: #28016 (comment) . I will update the PR desciption shortly, so the PR will be ready for review. /cc @cloud-fan |
|
jenkins, retest this, please |
|
@cloud-fan @dongjoon-hyun @bersprockets Please, review this PR. |
| var julianDays: Int) | ||
| extends DateWritable { | ||
|
|
||
| def this() = this(0, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that gregorianDays and julianDays will be set later via the set method.
|
Test build #120367 has finished for PR 28016 at commit
|
|
Retest this please. |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM (Pending Jenkins with hive1.2 to make it sure.)
Thank you for swift fix, @MaxGekk !
|
Test build #120379 has finished for PR 28016 at commit
|
|
Hi, @MaxGekk . |
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DaysWritable.scala
Show resolved
Hide resolved
| public int getInt(int rowId) { | ||
| return (int) longData.vector[getRowIndex(rowId)]; | ||
| int index = getRowIndex(rowId); | ||
| int value = (int) longData.vector[index]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: int value = (int) longData.vector[getRowIndex(rowId)];
sql/core/v1.2/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnVector.java
Show resolved
Hide resolved
| import org.apache.spark.sql.catalyst.InternalRow | ||
| import org.apache.spark.sql.catalyst.expressions._ | ||
| import org.apache.spark.sql.catalyst.util._ | ||
| import org.apache.spark.sql.execution.datasources.DaysWritable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unnecessary change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I remove it, the build fails with:
Error:(1012, 44) not found: type DaysWritable
private def getDateWritable(value: Any): DaysWritable =
I moved DaysWritable from sql/hive to sql/core to reuse it in ORC.
|
Test build #120403 has finished for PR 28016 at commit
|
|
jenkins, retest this, please |
|
Test build #120408 has finished for PR 28016 at commit
|
|
Test build #120416 has finished for PR 28016 at commit
|
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DaysWritable.scala to sql/core/v2.3/src/main/scala/org/apache/spark/sql/execution/datasources/orc/DaysWritable.scala.
| * This is a clone of `org.apache.spark.sql.execution.datasources.DaysWritable`. | ||
| * The class is cloned because Hive ORC v1.2 uses different `DateWritable`: | ||
| * - v1.2: `org.apache.orc.storage.serde2.io.DateWritable` | ||
| * - v2.3 and `HiveInspectors`: `org.apache.hadoop.hive.serde2.io.DateWritable` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need the above 4 line comments because v1.2 and v2.3 folder structure already is designed for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I move it to 2.3, and enable hive-1.2, HiveInspectors will use DaysWritable from v1.2 which is wrong or I miss something in your proposal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Let me check again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
? Sorry, why do you move this to 2.3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I ask is keep this AS-IS, and move the other DaysWritable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm checking again from my side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I tried MaxGekk#26, and it fails when I enable hive-1.2:
build/sbt -Phive-1.2 clean package
Feel free to open a PR for this PR if you mean something different.
| import org.apache.hadoop.hive.serde2.io.{DateWritable, HiveDecimalWritable} | ||
|
|
||
| import org.apache.spark.sql.catalyst.expressions.SpecializedGetters | ||
| import org.apache.spark.sql.execution.datasources.DaysWritable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After moving DaysWritable.scala to here. I guess we don't need this line. Please try to remove.
Why? |
|
Ur, are you asking why we keep |
|
My bad. I overlooked the difference in |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @MaxGekk and @cloud-fan .
Merged to master/3.0.
… for ORC datasource ### What changes were proposed in this pull request? This PR (SPARK-31238) aims the followings. 1. Modified ORC Vectorized Reader, in particular, OrcColumnVector v1.2 and v2.3. After the changes, it uses `DateTimeUtils. rebaseJulianToGregorianDays()` added by #27915 . The method performs rebasing days from the hybrid calendar (Julian + Gregorian) to Proleptic Gregorian calendar. It builds a local date in the original calendar, extracts date fields `year`, `month` and `day` from the local date, and builds another local date in the target calendar. After that, it calculates days from the epoch `1970-01-01` for the resulted local date. 2. Introduced rebasing dates while saving ORC files, in particular, I modified `OrcShimUtils. getDateWritable` v1.2 and v2.3, and returned `DaysWritable` instead of Hive's `DateWritable`. The `DaysWritable` class was added by the PR #27890 (and fixed by #27962). I moved `DaysWritable` from `sql/hive` to `sql/core` to re-use it in ORC datasource. ### Why are the changes needed? For the backward compatibility with Spark 2.4 and earlier versions. The changes allow users to read dates/timestamps saved by previous version, and get the same result. ### Does this PR introduce any user-facing change? Yes. Before the changes, loading the date `1200-01-01` saved by Spark 2.4.5 returns the following: ```scala scala> spark.read.orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc").show(false) +----------+ |dt | +----------+ |1200-01-08| +----------+ ``` After the changes ```scala scala> spark.read.orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc").show(false) +----------+ |dt | +----------+ |1200-01-01| +----------+ ``` ### How was this patch tested? - By running `OrcSourceSuite` and `HiveOrcSourceSuite`. - Add new test `SPARK-31238: compatibility with Spark 2.4 in reading dates` to `OrcSuite` which reads an ORC file saved by Spark 2.4.5 via the commands: ```shell $ export TZ="America/Los_Angeles" ``` ```scala scala> sql("select cast('1200-01-01' as date) dt").write.mode("overwrite").orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc") scala> spark.read.orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc").show(false) +----------+ |dt | +----------+ |1200-01-01| +----------+ ``` - Add round trip test `SPARK-31238: rebasing dates in write`. The test `SPARK-31238: compatibility with Spark 2.4 in reading dates` confirms rebasing in read. So, we can check rebasing in write. Closes #28016 from MaxGekk/rebase-date-orc. Authored-by: Maxim Gekk <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit d72ec85) Signed-off-by: Dongjoon Hyun <[email protected]>
|
Hi, @MaxGekk . |
… for ORC datasource ### What changes were proposed in this pull request? This PR (SPARK-31238) aims the followings. 1. Modified ORC Vectorized Reader, in particular, OrcColumnVector v1.2 and v2.3. After the changes, it uses `DateTimeUtils. rebaseJulianToGregorianDays()` added by apache#27915 . The method performs rebasing days from the hybrid calendar (Julian + Gregorian) to Proleptic Gregorian calendar. It builds a local date in the original calendar, extracts date fields `year`, `month` and `day` from the local date, and builds another local date in the target calendar. After that, it calculates days from the epoch `1970-01-01` for the resulted local date. 2. Introduced rebasing dates while saving ORC files, in particular, I modified `OrcShimUtils. getDateWritable` v1.2 and v2.3, and returned `DaysWritable` instead of Hive's `DateWritable`. The `DaysWritable` class was added by the PR apache#27890 (and fixed by apache#27962). I moved `DaysWritable` from `sql/hive` to `sql/core` to re-use it in ORC datasource. ### Why are the changes needed? For the backward compatibility with Spark 2.4 and earlier versions. The changes allow users to read dates/timestamps saved by previous version, and get the same result. ### Does this PR introduce any user-facing change? Yes. Before the changes, loading the date `1200-01-01` saved by Spark 2.4.5 returns the following: ```scala scala> spark.read.orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc").show(false) +----------+ |dt | +----------+ |1200-01-08| +----------+ ``` After the changes ```scala scala> spark.read.orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc").show(false) +----------+ |dt | +----------+ |1200-01-01| +----------+ ``` ### How was this patch tested? - By running `OrcSourceSuite` and `HiveOrcSourceSuite`. - Add new test `SPARK-31238: compatibility with Spark 2.4 in reading dates` to `OrcSuite` which reads an ORC file saved by Spark 2.4.5 via the commands: ```shell $ export TZ="America/Los_Angeles" ``` ```scala scala> sql("select cast('1200-01-01' as date) dt").write.mode("overwrite").orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc") scala> spark.read.orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc").show(false) +----------+ |dt | +----------+ |1200-01-01| +----------+ ``` - Add round trip test `SPARK-31238: rebasing dates in write`. The test `SPARK-31238: compatibility with Spark 2.4 in reading dates` confirms rebasing in read. So, we can check rebasing in write. Closes apache#28016 from MaxGekk/rebase-date-orc. Authored-by: Maxim Gekk <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
This PR (SPARK-31238) aims the followings.
DateTimeUtils. rebaseJulianToGregorianDays()added by [SPARK-31159][SQL] Rebase date/timestamp from/to Julian calendar in parquet #27915 . The method performs rebasing days from the hybrid calendar (Julian + Gregorian) to Proleptic Gregorian calendar. It builds a local date in the original calendar, extracts date fieldsyear,monthanddayfrom the local date, and builds another local date in the target calendar. After that, it calculates days from the epoch1970-01-01for the resulted local date.OrcShimUtils. getDateWritablev1.2 and v2.3, and returnedDaysWritableinstead of Hive'sDateWritable. TheDaysWritableclass was added by the PR [SPARK-31076][SQL][FOLLOWUP] Incapsulate date rebasing toDaysWritable#27890 (and fixed by [SPARK-31195][SQL] Correct and reuse days rebase functions ofDateTimeUtilsinDaysWritable#27962). I movedDaysWritablefromsql/hivetosql/coreto re-use it in ORC datasource.Why are the changes needed?
For the backward compatibility with Spark 2.4 and earlier versions. The changes allow users to read dates/timestamps saved by previous version, and get the same result.
Does this PR introduce any user-facing change?
Yes. Before the changes, loading the date
1200-01-01saved by Spark 2.4.5 returns the following:After the changes
How was this patch tested?
OrcSourceSuiteandHiveOrcSourceSuite.SPARK-31238: compatibility with Spark 2.4 in reading datestoOrcSuitewhich reads an ORC file saved by Spark 2.4.5 via the commands:SPARK-31238: rebasing dates in write. The testSPARK-31238: compatibility with Spark 2.4 in reading datesconfirms rebasing in read. So, we can check rebasing in write.