Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Feb 12, 2021

What changes were proposed in this pull request?

Modify RandomDataGenerator.forType() to allow generation of dates/timestamps that are valid in both Julian and Proleptic Gregorian calendars. Currently, the function can produce a date (for example 1582-10-06) which is valid in the Proleptic Gregorian calendar. Though it cannot be saved to ORC files AS IS since ORC format (ORC libs in fact) assumes Julian calendar. So, Spark shifts 1582-10-06 to the next valid date 1582-10-15 while saving it to ORC files. And as a consequence of that, the test fails because it compares original date 1582-10-06 and the date 1582-10-15 loaded back from the ORC files.

In this PR, I propose to generate valid dates/timestamps in both calendars for ORC datasource till SPARK-34440 is resolved.

Why are the changes needed?

The changes fix failures of HiveOrcHadoopFsRelationSuite. For instance, the test "test all data types" fails with the seed 610710213676:

== Results ==
!== Correct Answer - 20 ==    == Spark Answer - 20 ==
 struct<index:int,col:date>   struct<index:int,col:date>
...
![9,1582-10-06]               [9,1582-10-15]

Does this PR introduce any user-facing change?

No

How was this patch tested?

By running the modified test suite:

$ build/sbt -Phive -Phive-thriftserver "test:testOnly *HiveOrcHadoopFsRelationSuite"

@github-actions github-actions bot added the SQL label Feb 12, 2021
@MaxGekk
Copy link
Member Author

MaxGekk commented Feb 12, 2021

@dongjoon-hyun Could you take a look at this PR, please.

@SparkQA
Copy link

SparkQA commented Feb 12, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39707/

@dongjoon-hyun
Copy link
Member

Thank you for pinging me, @MaxGekk !

@SparkQA
Copy link

SparkQA commented Feb 12, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39707/

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for working on these. I have two suggestions.

  1. Actually, this seems to reduce an existing test coverage for all data sources (especially Parquet/Avro) by reducing the input scope. We had better keep the test coverage for Parquet/Avro if they works. Do you think we can narrow down to ORC only?

  2. For ORC, it would be great if we can have a test case for the example random seed. We can keep the test case as ignore state with a JIRA ID.

How do you think about the above, @MaxGekk ?

@SparkQA
Copy link

SparkQA commented Feb 13, 2021

Test build #135126 has finished for PR 31552 at commit 0222def.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Feb 13, 2021

@dongjoon-hyun Thank you for your quick response.

Actually, this seems to reduce an existing test coverage for all data sources ...

Yep, it does. I should highlight that the "test all data types" test checks end-to-end scenario, and if there are any conversions between calendars Julian <-> Gregorian, the test fails on some seeds. That's why we forcibly set the rebasing mode to CORRECTED for Avro and Parquet in the test, see

SQLConf.LEGACY_PARQUET_REBASE_MODE_IN_WRITE.key -> CORRECTED.toString,
SQLConf.LEGACY_PARQUET_INT96_REBASE_MODE_IN_WRITE.key -> CORRECTED.toString,
SQLConf.LEGACY_AVRO_REBASE_MODE_IN_WRITE.key -> CORRECTED.toString) {

to avoid the dates that don't exist in one of the calendars.

At the same time, ORC is tested in the "LEGACY" mode in fact, where we perform datetime rebasing between calendars. So, if we would enable "LEGACY" for Avro or Parquet, they will fail as well.

Do you think we can narrow down to ORC only?

We can exclude some date ranges like 1582-10-05 .. 1582-10-15 + 29 Feb in some leap years. In that case, we can test Avro/Parquet in the "LEGACY" mode too (and remove the SQL config settings showed above).

For me, the case of ORC's date (and timestamps too) seems similar to Parquet's INT96 timestamps. The ORC spec says nothing about the calendar systems (https://orc.apache.org/specification/ORCv2/), and it just mentions the offset in days from the epoch:
"
Date data is encoded with a PRESENT stream, a DATA stream that records the number of days after January 1, 1970 in UTC
"

Since DATE just stores as number of days, the calendar system is not "hard coded" in the spec. I think we should support the "CORRECTED" mode (via a SQL config or/and a DS option) in the ORC datasource too as we did that recently for Parquet INT96 in #30056. @cloud-fan @bart-samwel WDYT?

@bart-samwel
Copy link

For me, the case of ORC's date (and timestamps too) seems similar to Parquet's INT96 timestamps. The ORC spec says nothing about the calendar systems (https://orc.apache.org/specification/ORCv2/), and it just mentions the offset in days from the epoch:
" Date data is encoded with a PRESENT stream, a DATA stream that records the number of days after January 1, 1970 in UTC "
Since DATE just stores as number of days, the calendar system is not "hard coded" in the spec. I think we should support the "CORRECTED" mode (via a SQL config or/and a DS option) in the ORC datasource too as we did that recently for Parquet INT96 in #30056. @cloud-fan @bart-samwel WDYT?

That makes sense. At least then we can store the data that gets generated internally and read it back. It would take some work for backward compatibility just like for Parquet -- e.g. we'd have to add metadata to the ORC files, and if that's not present, we'd need to detect which system wrote the file and base the read rebasing decision on that.

FWIW, I think the data generator limitations should be explicitly tweaked for the tests to match the expectations of the test. I.e., if we expect the test won't handle some kind of date correctly, then and only then do we turn those off.

@SparkQA
Copy link

SparkQA commented Feb 15, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39736/

@SparkQA
Copy link

SparkQA commented Feb 15, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39736/

@SparkQA
Copy link

SparkQA commented Feb 15, 2021

Test build #135155 has finished for PR 31552 at commit 2b24fc4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@HyukjinKwon
Copy link
Member

I agree with @dongjoon-hyun and @bart-samwel points. Thanks for addressing it @MaxGekk.

@HyukjinKwon
Copy link
Member

Merged to master.

@MaxGekk can we port this back to branch-3.1 and branch-3.0?

MaxGekk added a commit to MaxGekk/spark that referenced this pull request Feb 19, 2021
Modify `RandomDataGenerator.forType()` to allow generation of dates/timestamps that are valid in both Julian and Proleptic Gregorian calendars. Currently, the function can produce a date (for example `1582-10-06`) which is valid in the Proleptic Gregorian calendar. Though it cannot be saved to ORC files AS IS since ORC format (ORC libs in fact) assumes Julian calendar. So, Spark shifts `1582-10-06` to the next valid date `1582-10-15` while saving it to ORC files. And as a consequence of that, the test fails because it compares original date `1582-10-06` and the date `1582-10-15` loaded back from the ORC files.

In this PR, I propose to generate valid dates/timestamps in both calendars for ORC datasource till SPARK-34440 is resolved.

The changes fix failures of `HiveOrcHadoopFsRelationSuite`. For instance, the test "test all data types" fails with the seed **610710213676**:
```
== Results ==
!== Correct Answer - 20 ==    == Spark Answer - 20 ==
 struct<index:int,col:date>   struct<index:int,col:date>
...
![9,1582-10-06]               [9,1582-10-15]
```

No

By running the modified test suite:
```
$ build/sbt -Phive -Phive-thriftserver "test:testOnly *HiveOrcHadoopFsRelationSuite"
```

Closes apache#31552 from MaxGekk/fix-HiveOrcHadoopFsRelationSuite.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit 0316105)
Signed-off-by: Max Gekk <[email protected]>
@MaxGekk
Copy link
Member Author

MaxGekk commented Feb 19, 2021

Here is the backport to branch-3.1 and branch-3.0: #31589

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants