-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-31806][SQL][TESTS] Check reading date/timestamp from legacy parquet: dictionary encoding, w/o Spark version #28630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…es-update # Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala
|
Test build #123062 has finished for PR 28630 at commit
|
|
@cloud-fan @mswit-databricks @HyukjinKwon Please, review this PR. |
|
Test build #123075 has finished for PR 28630 at commit
|
| DateTimeTestUtils.withDefaultTimeZone(DateTimeTestUtils.LA) { | ||
| withSQLConf( | ||
| SQLConf.SESSION_LOCAL_TIMEZONE.key -> DateTimeTestUtils.LA.getId, | ||
| SQLConf.LEGACY_PARQUET_REBASE_MODE_IN_WRITE.key -> "CORRECTED") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this? we only run this test in 2.x to generate files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't affect on 2.4.5/2.4.6 but the default value throws exception in 3.0/master. I added it to debug the code in master otherwise I got an exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the config setting
| save( | ||
| usTs, | ||
| "timestamp", | ||
| s"before_1582_timestamp_int96_plain_v$version.snappy.parquet", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we have 2 files for plain and dictionary-encoding for int96? other types just have one file and 2 columns.
if it's caused by some parquet limitation, let's write a comment to explain it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because INT96 always use dictionary encoding independent from number of values and theirs uniqueness. I have to explicitly turn off dictionary encoding while saving to parquet files, see the test above.
Other types don't have such "problem" - for one column parquet lib uses dict encoding because all values are unique, for another one it applies plain enc because all values in date/timestamp columns are the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a comment
|
Test build #123092 has finished for PR 28630 at commit
|
|
jenkins, retest this, please |
|
Test build #123097 has finished for PR 28630 at commit
|
|
thanks, merging to master/3.0! |
…rquet: dictionary encoding, w/o Spark version ### What changes were proposed in this pull request? 1. Add the following parquet files to the resource folder `sql/core/src/test/resources/test-data`: - Files saved by Spark 2.4.5 (cee4ecb) without meta info `org.apache.spark.version` - `before_1582_date_v2_4_5.snappy.parquet` with 2 date columns of the type **INT32 L:DATE** - `PLAIN` (8 date values of `1001-01-01`) and `PLAIN_DICTIONARY` (`1001-01-01`..`1001-01-08`). - `before_1582_timestamp_micros_v2_4_5.snappy.parquet` with 2 timestamp columns of the type **INT64 L:TIMESTAMP(MICROS,true)** - `PLAIN` (8 date values of `1001-01-01 01:02:03.123456`) and `PLAIN_DICTIONARY` (`1001-01-01 01:02:03.123456`..`1001-01-08 01:02:03.123456`). - `before_1582_timestamp_millis_v2_4_5.snappy.parquet` with 2 timestamp columns of the type **INT64 L:TIMESTAMP(MILLIS,true)** - `PLAIN` (8 date values of `1001-01-01 01:02:03.123`) and `PLAIN_DICTIONARY` (`1001-01-01 01:02:03.123`..`1001-01-08 01:02:03.123`). - `before_1582_timestamp_int96_plain_v2_4_5.snappy.parquet` with 2 timestamp columns of the type **INT96** - `PLAIN` (8 date values of `1001-01-01 01:02:03.123456`) and `PLAIN` (`1001-01-01 01:02:03.123456`..`1001-01-08 01:02:03.123456`). - `before_1582_timestamp_int96_dict_v2_4_5.snappy.parquet` with 2 timestamp columns of the type **INT96** - `PLAIN_DICTIONARY` (8 date values of `1001-01-01 01:02:03.123456`) and `PLAIN_DICTIONARY` (`1001-01-01 01:02:03.123456`..`1001-01-08 01:02:03.123456`). - Files saved by Spark 2.4.6-rc3 (570848d) with the meta info `org.apache.spark.version = 2.4.6`: - `before_1582_date_v2_4_6.snappy.parquet` replaces `before_1582_date_v2_4.snappy.parquet`. And it is similar to `before_1582_date_v2_4_5.snappy.parquet` except Spark version in parquet meta info. - `before_1582_timestamp_micros_v2_4_6.snappy.parquet` replaces `before_1582_timestamp_micros_v2_4.snappy.parquet`. And it is similar to `before_1582_timestamp_micros_v2_4_5.snappy.parquet` except meta info. - `before_1582_timestamp_millis_v2_4_6.snappy.parquet` replaces `before_1582_timestamp_millis_v2_4.snappy.parquet`. And it is similar to `before_1582_timestamp_millis_v2_4_5.snappy.parquet` except meta info. - `before_1582_timestamp_int96_plain_v2_4_6.snappy.parquet` is similar to `before_1582_timestamp_int96_dict_v2_4_5.snappy.parquet` except meta info. - `before_1582_timestamp_int96_dict_v2_4_6.snappy.parquet` replaces `before_1582_timestamp_int96_v2_4.snappy.parquet`. And it is similar to `before_1582_timestamp_int96_dict_v2_4_5.snappy.parquet` except meta info. 2. Add new test "generate test files for checking compatibility with Spark 2.4" to `ParquetIOSuite` (marked as ignored). The parquet files above were generated by this test. 3. Modified "SPARK-31159: compatibility with Spark 2.4 in reading dates/timestamps" in `ParquetIOSuite` to use new parquet files. ### Why are the changes needed? To improve test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running `ParquetIOSuite`. Closes #28630 from MaxGekk/parquet-files-update. Authored-by: Max Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 7e4f5bb) Signed-off-by: Wenchen Fan <[email protected]>
|
+1 |
|
@HyukjinKwon @cloud-fan This build failure #28630 (comment) has been fixed by #28639 |
### What changes were proposed in this pull request? Modified formatting of expected timestamp strings in the test `JavaBeanDeserializationSuite`.`testSpark22000` to correctly format timestamps with **zero** seconds fraction. Current implementation outputs `.0` but must be empty string. From SPARK-31820 failure: - should be `2020-05-25 12:39:17` - but incorrect expected string is `2020-05-25 12:39:17.0` ### Why are the changes needed? To make `JavaBeanDeserializationSuite` stable, and avoid test failures like #28630 (comment) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? I changed https://github.com/apache/spark/blob/7dff3b125de23a4d6ce834217ee08973b259414c/sql/core/src/test/java/test/org/apache/spark/sql/JavaBeanDeserializationSuite.java#L207 to ```java new java.sql.Timestamp((System.currentTimeMillis() / 1000) * 1000), ``` to force zero seconds fraction. Closes #28639 from MaxGekk/fix-JavaBeanDeserializationSuite. Authored-by: Max Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request? Modified formatting of expected timestamp strings in the test `JavaBeanDeserializationSuite`.`testSpark22000` to correctly format timestamps with **zero** seconds fraction. Current implementation outputs `.0` but must be empty string. From SPARK-31820 failure: - should be `2020-05-25 12:39:17` - but incorrect expected string is `2020-05-25 12:39:17.0` ### Why are the changes needed? To make `JavaBeanDeserializationSuite` stable, and avoid test failures like #28630 (comment) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? I changed https://github.com/apache/spark/blob/7dff3b125de23a4d6ce834217ee08973b259414c/sql/core/src/test/java/test/org/apache/spark/sql/JavaBeanDeserializationSuite.java#L207 to ```java new java.sql.Timestamp((System.currentTimeMillis() / 1000) * 1000), ``` to force zero seconds fraction. Closes #28639 from MaxGekk/fix-JavaBeanDeserializationSuite. Authored-by: Max Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 87d34e6) Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
sql/core/src/test/resources/test-data:org.apache.spark.versionbefore_1582_date_v2_4_5.snappy.parquetwith 2 date columns of the type INT32 L:DATE -PLAIN(8 date values of1001-01-01) andPLAIN_DICTIONARY(1001-01-01..1001-01-08).before_1582_timestamp_micros_v2_4_5.snappy.parquetwith 2 timestamp columns of the type INT64 L:TIMESTAMP(MICROS,true) -PLAIN(8 date values of1001-01-01 01:02:03.123456) andPLAIN_DICTIONARY(1001-01-01 01:02:03.123456..1001-01-08 01:02:03.123456).before_1582_timestamp_millis_v2_4_5.snappy.parquetwith 2 timestamp columns of the type INT64 L:TIMESTAMP(MILLIS,true) -PLAIN(8 date values of1001-01-01 01:02:03.123) andPLAIN_DICTIONARY(1001-01-01 01:02:03.123..1001-01-08 01:02:03.123).before_1582_timestamp_int96_plain_v2_4_5.snappy.parquetwith 2 timestamp columns of the type INT96 -PLAIN(8 date values of1001-01-01 01:02:03.123456) andPLAIN(1001-01-01 01:02:03.123456..1001-01-08 01:02:03.123456).before_1582_timestamp_int96_dict_v2_4_5.snappy.parquetwith 2 timestamp columns of the type INT96 -PLAIN_DICTIONARY(8 date values of1001-01-01 01:02:03.123456) andPLAIN_DICTIONARY(1001-01-01 01:02:03.123456..1001-01-08 01:02:03.123456).org.apache.spark.version = 2.4.6:before_1582_date_v2_4_6.snappy.parquetreplacesbefore_1582_date_v2_4.snappy.parquet. And it is similar tobefore_1582_date_v2_4_5.snappy.parquetexcept Spark version in parquet meta info.before_1582_timestamp_micros_v2_4_6.snappy.parquetreplacesbefore_1582_timestamp_micros_v2_4.snappy.parquet. And it is similar tobefore_1582_timestamp_micros_v2_4_5.snappy.parquetexcept meta info.before_1582_timestamp_millis_v2_4_6.snappy.parquetreplacesbefore_1582_timestamp_millis_v2_4.snappy.parquet. And it is similar tobefore_1582_timestamp_millis_v2_4_5.snappy.parquetexcept meta info.before_1582_timestamp_int96_plain_v2_4_6.snappy.parquetis similar tobefore_1582_timestamp_int96_dict_v2_4_5.snappy.parquetexcept meta info.before_1582_timestamp_int96_dict_v2_4_6.snappy.parquetreplacesbefore_1582_timestamp_int96_v2_4.snappy.parquet. And it is similar tobefore_1582_timestamp_int96_dict_v2_4_5.snappy.parquetexcept meta info.ParquetIOSuite(marked as ignored). The parquet files above were generated by this test.ParquetIOSuiteto use new parquet files.Why are the changes needed?
To improve test coverage.
Does this PR introduce any user-facing change?
No
How was this patch tested?
By running
ParquetIOSuite.