[SPARK-31806][SQL][TESTS] Check reading date/timestamp from legacy parquet: dictionary encoding, w/o Spark version #28630

MaxGekk · 2020-05-24T14:44:06Z

What changes were proposed in this pull request?

Add the following parquet files to the resource folder sql/core/src/test/resources/test-data:
- Files saved by Spark 2.4.5 (cee4ecb) without meta info org.apache.spark.version
  - before_1582_date_v2_4_5.snappy.parquet with 2 date columns of the type INT32 L:DATE - PLAIN (8 date values of 1001-01-01) and PLAIN_DICTIONARY (1001-01-01..1001-01-08).
  - before_1582_timestamp_micros_v2_4_5.snappy.parquet with 2 timestamp columns of the type INT64 L:TIMESTAMP(MICROS,true) - PLAIN (8 date values of 1001-01-01 01:02:03.123456) and PLAIN_DICTIONARY (1001-01-01 01:02:03.123456..1001-01-08 01:02:03.123456).
  - before_1582_timestamp_millis_v2_4_5.snappy.parquet with 2 timestamp columns of the type INT64 L:TIMESTAMP(MILLIS,true) - PLAIN (8 date values of 1001-01-01 01:02:03.123) and PLAIN_DICTIONARY (1001-01-01 01:02:03.123..1001-01-08 01:02:03.123).
  - before_1582_timestamp_int96_plain_v2_4_5.snappy.parquet with 2 timestamp columns of the type INT96 - PLAIN (8 date values of 1001-01-01 01:02:03.123456) and PLAIN (1001-01-01 01:02:03.123456..1001-01-08 01:02:03.123456).
  - before_1582_timestamp_int96_dict_v2_4_5.snappy.parquet with 2 timestamp columns of the type INT96 - PLAIN_DICTIONARY (8 date values of 1001-01-01 01:02:03.123456) and PLAIN_DICTIONARY (1001-01-01 01:02:03.123456..1001-01-08 01:02:03.123456).
- Files saved by Spark 2.4.6-rc3 (570848d) with the meta info org.apache.spark.version = 2.4.6:
  - before_1582_date_v2_4_6.snappy.parquet replaces before_1582_date_v2_4.snappy.parquet. And it is similar to before_1582_date_v2_4_5.snappy.parquet except Spark version in parquet meta info.
  - before_1582_timestamp_micros_v2_4_6.snappy.parquet replaces before_1582_timestamp_micros_v2_4.snappy.parquet. And it is similar to before_1582_timestamp_micros_v2_4_5.snappy.parquet except meta info.
  - before_1582_timestamp_millis_v2_4_6.snappy.parquet replaces before_1582_timestamp_millis_v2_4.snappy.parquet. And it is similar to before_1582_timestamp_millis_v2_4_5.snappy.parquet except meta info.
  - before_1582_timestamp_int96_plain_v2_4_6.snappy.parquet is similar to before_1582_timestamp_int96_dict_v2_4_5.snappy.parquet except meta info.
  - before_1582_timestamp_int96_dict_v2_4_6.snappy.parquet replaces before_1582_timestamp_int96_v2_4.snappy.parquet. And it is similar to before_1582_timestamp_int96_dict_v2_4_5.snappy.parquet except meta info.
Add new test "generate test files for checking compatibility with Spark 2.4" to ParquetIOSuite (marked as ignored). The parquet files above were generated by this test.
Modified "SPARK-31159: compatibility with Spark 2.4 in reading dates/timestamps" in ParquetIOSuite to use new parquet files.

Why are the changes needed?

To improve test coverage.

Does this PR introduce any user-facing change?

No

How was this patch tested?

By running ParquetIOSuite.

…es-update # Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala

SparkQA · 2020-05-24T19:14:15Z

Test build #123062 has finished for PR 28630 at commit b0e4a32.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2020-05-25T06:40:04Z

@cloud-fan @mswit-databricks @HyukjinKwon Please, review this PR.

SparkQA · 2020-05-25T11:39:19Z

Test build #123075 has finished for PR 28630 at commit 0add1a2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-05-25T14:01:34Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala

+    DateTimeTestUtils.withDefaultTimeZone(DateTimeTestUtils.LA) {
+      withSQLConf(
+        SQLConf.SESSION_LOCAL_TIMEZONE.key -> DateTimeTestUtils.LA.getId,
+        SQLConf.LEGACY_PARQUET_REBASE_MODE_IN_WRITE.key -> "CORRECTED") {


why do we need this? we only run this test in 2.x to generate files.

It doesn't affect on 2.4.5/2.4.6 but the default value throws exception in 3.0/master. I added it to debug the code in master otherwise I got an exception.

I removed the config setting

cloud-fan · 2020-05-25T14:04:03Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala

+          save(
+            usTs,
+            "timestamp",
+            s"before_1582_timestamp_int96_plain_v$version.snappy.parquet",


why we have 2 files for plain and dictionary-encoding for int96? other types just have one file and 2 columns.

if it's caused by some parquet limitation, let's write a comment to explain it.

because INT96 always use dictionary encoding independent from number of values and theirs uniqueness. I have to explicitly turn off dictionary encoding while saving to parquet files, see the test above.

Other types don't have such "problem" - for one column parquet lib uses dict encoding because all values are unique, for another one it applies plain enc because all values in date/timestamp columns are the same.

I added a comment

SparkQA · 2020-05-25T19:39:51Z

Test build #123092 has finished for PR 28630 at commit 3f2b474.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2020-05-25T19:44:14Z

jenkins, retest this, please

SparkQA · 2020-05-26T00:14:08Z

Test build #123097 has finished for PR 28630 at commit 3f2b474.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-05-26T05:15:49Z

thanks, merging to master/3.0!

…rquet: dictionary encoding, w/o Spark version ### What changes were proposed in this pull request? 1. Add the following parquet files to the resource folder `sql/core/src/test/resources/test-data`: - Files saved by Spark 2.4.5 (cee4ecb) without meta info `org.apache.spark.version` - `before_1582_date_v2_4_5.snappy.parquet` with 2 date columns of the type **INT32 L:DATE** - `PLAIN` (8 date values of `1001-01-01`) and `PLAIN_DICTIONARY` (`1001-01-01`..`1001-01-08`). - `before_1582_timestamp_micros_v2_4_5.snappy.parquet` with 2 timestamp columns of the type **INT64 L:TIMESTAMP(MICROS,true)** - `PLAIN` (8 date values of `1001-01-01 01:02:03.123456`) and `PLAIN_DICTIONARY` (`1001-01-01 01:02:03.123456`..`1001-01-08 01:02:03.123456`). - `before_1582_timestamp_millis_v2_4_5.snappy.parquet` with 2 timestamp columns of the type **INT64 L:TIMESTAMP(MILLIS,true)** - `PLAIN` (8 date values of `1001-01-01 01:02:03.123`) and `PLAIN_DICTIONARY` (`1001-01-01 01:02:03.123`..`1001-01-08 01:02:03.123`). - `before_1582_timestamp_int96_plain_v2_4_5.snappy.parquet` with 2 timestamp columns of the type **INT96** - `PLAIN` (8 date values of `1001-01-01 01:02:03.123456`) and `PLAIN` (`1001-01-01 01:02:03.123456`..`1001-01-08 01:02:03.123456`). - `before_1582_timestamp_int96_dict_v2_4_5.snappy.parquet` with 2 timestamp columns of the type **INT96** - `PLAIN_DICTIONARY` (8 date values of `1001-01-01 01:02:03.123456`) and `PLAIN_DICTIONARY` (`1001-01-01 01:02:03.123456`..`1001-01-08 01:02:03.123456`). - Files saved by Spark 2.4.6-rc3 (570848d) with the meta info `org.apache.spark.version = 2.4.6`: - `before_1582_date_v2_4_6.snappy.parquet` replaces `before_1582_date_v2_4.snappy.parquet`. And it is similar to `before_1582_date_v2_4_5.snappy.parquet` except Spark version in parquet meta info. - `before_1582_timestamp_micros_v2_4_6.snappy.parquet` replaces `before_1582_timestamp_micros_v2_4.snappy.parquet`. And it is similar to `before_1582_timestamp_micros_v2_4_5.snappy.parquet` except meta info. - `before_1582_timestamp_millis_v2_4_6.snappy.parquet` replaces `before_1582_timestamp_millis_v2_4.snappy.parquet`. And it is similar to `before_1582_timestamp_millis_v2_4_5.snappy.parquet` except meta info. - `before_1582_timestamp_int96_plain_v2_4_6.snappy.parquet` is similar to `before_1582_timestamp_int96_dict_v2_4_5.snappy.parquet` except meta info. - `before_1582_timestamp_int96_dict_v2_4_6.snappy.parquet` replaces `before_1582_timestamp_int96_v2_4.snappy.parquet`. And it is similar to `before_1582_timestamp_int96_dict_v2_4_5.snappy.parquet` except meta info. 2. Add new test "generate test files for checking compatibility with Spark 2.4" to `ParquetIOSuite` (marked as ignored). The parquet files above were generated by this test. 3. Modified "SPARK-31159: compatibility with Spark 2.4 in reading dates/timestamps" in `ParquetIOSuite` to use new parquet files. ### Why are the changes needed? To improve test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running `ParquetIOSuite`. Closes #28630 from MaxGekk/parquet-files-update. Authored-by: Max Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 7e4f5bb) Signed-off-by: Wenchen Fan <[email protected]>

HyukjinKwon · 2020-05-26T05:26:31Z

+1

MaxGekk · 2020-05-26T06:55:32Z

@HyukjinKwon @cloud-fan This build failure #28630 (comment) has been fixed by #28639

### What changes were proposed in this pull request? Modified formatting of expected timestamp strings in the test `JavaBeanDeserializationSuite`.`testSpark22000` to correctly format timestamps with **zero** seconds fraction. Current implementation outputs `.0` but must be empty string. From SPARK-31820 failure: - should be `2020-05-25 12:39:17` - but incorrect expected string is `2020-05-25 12:39:17.0` ### Why are the changes needed? To make `JavaBeanDeserializationSuite` stable, and avoid test failures like #28630 (comment) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? I changed https://github.com/apache/spark/blob/7dff3b125de23a4d6ce834217ee08973b259414c/sql/core/src/test/java/test/org/apache/spark/sql/JavaBeanDeserializationSuite.java#L207 to ```java new java.sql.Timestamp((System.currentTimeMillis() / 1000) * 1000), ``` to force zero seconds fraction. Closes #28639 from MaxGekk/fix-JavaBeanDeserializationSuite. Authored-by: Max Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? Modified formatting of expected timestamp strings in the test `JavaBeanDeserializationSuite`.`testSpark22000` to correctly format timestamps with **zero** seconds fraction. Current implementation outputs `.0` but must be empty string. From SPARK-31820 failure: - should be `2020-05-25 12:39:17` - but incorrect expected string is `2020-05-25 12:39:17.0` ### Why are the changes needed? To make `JavaBeanDeserializationSuite` stable, and avoid test failures like #28630 (comment) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? I changed https://github.com/apache/spark/blob/7dff3b125de23a4d6ce834217ee08973b259414c/sql/core/src/test/java/test/org/apache/spark/sql/JavaBeanDeserializationSuite.java#L207 to ```java new java.sql.Timestamp((System.currentTimeMillis() / 1000) * 1000), ``` to force zero seconds fraction. Closes #28639 from MaxGekk/fix-JavaBeanDeserializationSuite. Authored-by: Max Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 87d34e6) Signed-off-by: Wenchen Fan <[email protected]>

MaxGekk added 9 commits May 23, 2020 12:52

Add test

ad57a05

Update test gen

30dd3b3

Re-gen parquet files

4359c4d

Ignore gen test

a101920

Fix dates

b07290e

Bug fix tests

707e936

Merge remote-tracking branch 'remotes/origin/master' into parquet-fil…

16a18a8

…es-update # Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala

Fix merge

95e73bc

Check 2.4 files in read by default

b0e4a32

probot-autolabeler bot added the SQL label May 24, 2020

MaxGekk changed the title ~~[WIP][SPARK-31806][SQL][TESTS] Check reading date/timestamp from Parquet: dictionary encoding, w/ Spark version~~ [WIP][SPARK-31806][SQL][TESTS] Check reading date/timestamp from legacy parquet: dictionary encoding, w/ Spark version May 24, 2020

MaxGekk changed the title ~~[WIP][SPARK-31806][SQL][TESTS] Check reading date/timestamp from legacy parquet: dictionary encoding, w/ Spark version~~ [WIP][SPARK-31806][SQL][TESTS] Check reading date/timestamp from legacy parquet: dictionary encoding, w/o Spark version May 24, 2020

MaxGekk changed the title ~~[WIP][SPARK-31806][SQL][TESTS] Check reading date/timestamp from legacy parquet: dictionary encoding, w/o Spark version~~ [SPARK-31806][SQL][TESTS] Check reading date/timestamp from legacy parquet: dictionary encoding, w/o Spark version May 25, 2020

Add comments

0add1a2

cloud-fan reviewed May 25, 2020

View reviewed changes

MaxGekk added 2 commits May 25, 2020 19:32

Don't set rebase in write in test input generation

4419760

Address Wenchen's review comment

3f2b474

cloud-fan approved these changes May 25, 2020

View reviewed changes

cloud-fan closed this in 7e4f5bb May 26, 2020

MaxGekk mentioned this pull request May 26, 2020

[SPARK-31820][SQL][TESTS] Fix flaky JavaBeanDeserializationSuite #28639

Closed

MaxGekk deleted the parquet-files-update branch June 5, 2020 19:44

[SPARK-31806][SQL][TESTS] Check reading date/timestamp from legacy parquet: dictionary encoding, w/o Spark version #28630

[SPARK-31806][SQL][TESTS] Check reading date/timestamp from legacy parquet: dictionary encoding, w/o Spark version #28630

Uh oh!

Conversation

MaxGekk commented May 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented May 24, 2020

Uh oh!

MaxGekk commented May 25, 2020

Uh oh!

SparkQA commented May 25, 2020

Uh oh!

cloud-fan May 25, 2020

Choose a reason for hiding this comment

Uh oh!

MaxGekk May 25, 2020

Choose a reason for hiding this comment

Uh oh!

MaxGekk May 25, 2020

Choose a reason for hiding this comment

Uh oh!

cloud-fan May 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGekk May 25, 2020

Choose a reason for hiding this comment

Uh oh!

MaxGekk May 25, 2020

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 25, 2020

Uh oh!

MaxGekk commented May 25, 2020

Uh oh!

SparkQA commented May 26, 2020

Uh oh!

cloud-fan commented May 26, 2020

Uh oh!

HyukjinKwon commented May 26, 2020

Uh oh!

MaxGekk commented May 26, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

MaxGekk commented May 24, 2020 •

edited

Loading

cloud-fan May 25, 2020 •

edited

Loading