-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-34424][SQL][TESTS] Fix failures of HiveOrcHadoopFsRelationSuite #31552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@dongjoon-hyun Could you take a look at this PR, please. |
|
Kubernetes integration test starting |
|
Thank you for pinging me, @MaxGekk ! |
|
Kubernetes integration test status failure |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for working on these. I have two suggestions.
-
Actually, this seems to reduce an existing test coverage for all data sources (especially Parquet/Avro) by reducing the input scope. We had better keep the test coverage for Parquet/Avro if they works. Do you think we can narrow down to ORC only?
-
For ORC, it would be great if we can have a test case for the example random seed. We can keep the test case as
ignorestate with a JIRA ID.
How do you think about the above, @MaxGekk ?
|
Test build #135126 has finished for PR 31552 at commit
|
|
@dongjoon-hyun Thank you for your quick response.
Yep, it does. I should highlight that the "test all data types" test checks end-to-end scenario, and if there are any conversions between calendars Julian <-> Gregorian, the test fails on some seeds. That's why we forcibly set the rebasing mode to spark/sql/hive/src/test/scala/org/apache/spark/sql/sources/HadoopFsRelationTest.scala Lines 157 to 159 in ba13b94
to avoid the dates that don't exist in one of the calendars. At the same time, ORC is tested in the "LEGACY" mode in fact, where we perform datetime rebasing between calendars. So, if we would enable "LEGACY" for Avro or Parquet, they will fail as well.
We can exclude some date ranges like 1582-10-05 .. 1582-10-15 + 29 Feb in some leap years. In that case, we can test Avro/Parquet in the "LEGACY" mode too (and remove the SQL config settings showed above). For me, the case of ORC's date (and timestamps too) seems similar to Parquet's INT96 timestamps. The ORC spec says nothing about the calendar systems (https://orc.apache.org/specification/ORCv2/), and it just mentions the offset in days from the epoch: |
That makes sense. At least then we can store the data that gets generated internally and read it back. It would take some work for backward compatibility just like for Parquet -- e.g. we'd have to add metadata to the ORC files, and if that's not present, we'd need to detect which system wrote the file and base the read rebasing decision on that. FWIW, I think the data generator limitations should be explicitly tweaked for the tests to match the expectations of the test. I.e., if we expect the test won't handle some kind of date correctly, then and only then do we turn those off. |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #135155 has finished for PR 31552 at commit
|
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
I agree with @dongjoon-hyun and @bart-samwel points. Thanks for addressing it @MaxGekk. |
|
Merged to master. @MaxGekk can we port this back to branch-3.1 and branch-3.0? |
Modify `RandomDataGenerator.forType()` to allow generation of dates/timestamps that are valid in both Julian and Proleptic Gregorian calendars. Currently, the function can produce a date (for example `1582-10-06`) which is valid in the Proleptic Gregorian calendar. Though it cannot be saved to ORC files AS IS since ORC format (ORC libs in fact) assumes Julian calendar. So, Spark shifts `1582-10-06` to the next valid date `1582-10-15` while saving it to ORC files. And as a consequence of that, the test fails because it compares original date `1582-10-06` and the date `1582-10-15` loaded back from the ORC files. In this PR, I propose to generate valid dates/timestamps in both calendars for ORC datasource till SPARK-34440 is resolved. The changes fix failures of `HiveOrcHadoopFsRelationSuite`. For instance, the test "test all data types" fails with the seed **610710213676**: ``` == Results == !== Correct Answer - 20 == == Spark Answer - 20 == struct<index:int,col:date> struct<index:int,col:date> ... ![9,1582-10-06] [9,1582-10-15] ``` No By running the modified test suite: ``` $ build/sbt -Phive -Phive-thriftserver "test:testOnly *HiveOrcHadoopFsRelationSuite" ``` Closes apache#31552 from MaxGekk/fix-HiveOrcHadoopFsRelationSuite. Authored-by: Max Gekk <[email protected]> Signed-off-by: HyukjinKwon <[email protected]> (cherry picked from commit 0316105) Signed-off-by: Max Gekk <[email protected]>
|
Here is the backport to |
What changes were proposed in this pull request?
Modify
RandomDataGenerator.forType()to allow generation of dates/timestamps that are valid in both Julian and Proleptic Gregorian calendars. Currently, the function can produce a date (for example1582-10-06) which is valid in the Proleptic Gregorian calendar. Though it cannot be saved to ORC files AS IS since ORC format (ORC libs in fact) assumes Julian calendar. So, Spark shifts1582-10-06to the next valid date1582-10-15while saving it to ORC files. And as a consequence of that, the test fails because it compares original date1582-10-06and the date1582-10-15loaded back from the ORC files.In this PR, I propose to generate valid dates/timestamps in both calendars for ORC datasource till SPARK-34440 is resolved.
Why are the changes needed?
The changes fix failures of
HiveOrcHadoopFsRelationSuite. For instance, the test "test all data types" fails with the seed 610710213676:Does this PR introduce any user-facing change?
No
How was this patch tested?
By running the modified test suite: