Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented May 24, 2020

What changes were proposed in this pull request?

  1. Add the following parquet files to the resource folder sql/core/src/test/resources/test-data:
    • Files saved by Spark 2.4.5 (cee4ecb) without meta info org.apache.spark.version
      • before_1582_date_v2_4_5.snappy.parquet with 2 date columns of the type INT32 L:DATE - PLAIN (8 date values of 1001-01-01) and PLAIN_DICTIONARY (1001-01-01..1001-01-08).
      • before_1582_timestamp_micros_v2_4_5.snappy.parquet with 2 timestamp columns of the type INT64 L:TIMESTAMP(MICROS,true) - PLAIN (8 date values of 1001-01-01 01:02:03.123456) and PLAIN_DICTIONARY (1001-01-01 01:02:03.123456..1001-01-08 01:02:03.123456).
      • before_1582_timestamp_millis_v2_4_5.snappy.parquet with 2 timestamp columns of the type INT64 L:TIMESTAMP(MILLIS,true) - PLAIN (8 date values of 1001-01-01 01:02:03.123) and PLAIN_DICTIONARY (1001-01-01 01:02:03.123..1001-01-08 01:02:03.123).
      • before_1582_timestamp_int96_plain_v2_4_5.snappy.parquet with 2 timestamp columns of the type INT96 - PLAIN (8 date values of 1001-01-01 01:02:03.123456) and PLAIN (1001-01-01 01:02:03.123456..1001-01-08 01:02:03.123456).
      • before_1582_timestamp_int96_dict_v2_4_5.snappy.parquet with 2 timestamp columns of the type INT96 - PLAIN_DICTIONARY (8 date values of 1001-01-01 01:02:03.123456) and PLAIN_DICTIONARY (1001-01-01 01:02:03.123456..1001-01-08 01:02:03.123456).
    • Files saved by Spark 2.4.6-rc3 (570848d) with the meta info org.apache.spark.version = 2.4.6:
      • before_1582_date_v2_4_6.snappy.parquet replaces before_1582_date_v2_4.snappy.parquet. And it is similar to before_1582_date_v2_4_5.snappy.parquet except Spark version in parquet meta info.
      • before_1582_timestamp_micros_v2_4_6.snappy.parquet replaces before_1582_timestamp_micros_v2_4.snappy.parquet. And it is similar to before_1582_timestamp_micros_v2_4_5.snappy.parquet except meta info.
      • before_1582_timestamp_millis_v2_4_6.snappy.parquet replaces before_1582_timestamp_millis_v2_4.snappy.parquet. And it is similar to before_1582_timestamp_millis_v2_4_5.snappy.parquet except meta info.
      • before_1582_timestamp_int96_plain_v2_4_6.snappy.parquet is similar to before_1582_timestamp_int96_dict_v2_4_5.snappy.parquet except meta info.
      • before_1582_timestamp_int96_dict_v2_4_6.snappy.parquet replaces before_1582_timestamp_int96_v2_4.snappy.parquet. And it is similar to before_1582_timestamp_int96_dict_v2_4_5.snappy.parquet except meta info.
  2. Add new test "generate test files for checking compatibility with Spark 2.4" to ParquetIOSuite (marked as ignored). The parquet files above were generated by this test.
  3. Modified "SPARK-31159: compatibility with Spark 2.4 in reading dates/timestamps" in ParquetIOSuite to use new parquet files.

Why are the changes needed?

To improve test coverage.

Does this PR introduce any user-facing change?

No

How was this patch tested?

By running ParquetIOSuite.

@MaxGekk MaxGekk changed the title [WIP][SPARK-31806][SQL][TESTS] Check reading date/timestamp from Parquet: dictionary encoding, w/ Spark version [WIP][SPARK-31806][SQL][TESTS] Check reading date/timestamp from legacy parquet: dictionary encoding, w/ Spark version May 24, 2020
@MaxGekk MaxGekk changed the title [WIP][SPARK-31806][SQL][TESTS] Check reading date/timestamp from legacy parquet: dictionary encoding, w/ Spark version [WIP][SPARK-31806][SQL][TESTS] Check reading date/timestamp from legacy parquet: dictionary encoding, w/o Spark version May 24, 2020
@SparkQA
Copy link

SparkQA commented May 24, 2020

Test build #123062 has finished for PR 28630 at commit b0e4a32.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented May 25, 2020

@cloud-fan @mswit-databricks @HyukjinKwon Please, review this PR.

@MaxGekk MaxGekk changed the title [WIP][SPARK-31806][SQL][TESTS] Check reading date/timestamp from legacy parquet: dictionary encoding, w/o Spark version [SPARK-31806][SQL][TESTS] Check reading date/timestamp from legacy parquet: dictionary encoding, w/o Spark version May 25, 2020
@SparkQA
Copy link

SparkQA commented May 25, 2020

Test build #123075 has finished for PR 28630 at commit 0add1a2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

DateTimeTestUtils.withDefaultTimeZone(DateTimeTestUtils.LA) {
withSQLConf(
SQLConf.SESSION_LOCAL_TIMEZONE.key -> DateTimeTestUtils.LA.getId,
SQLConf.LEGACY_PARQUET_REBASE_MODE_IN_WRITE.key -> "CORRECTED") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this? we only run this test in 2.x to generate files.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't affect on 2.4.5/2.4.6 but the default value throws exception in 3.0/master. I added it to debug the code in master otherwise I got an exception.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the config setting

save(
usTs,
"timestamp",
s"before_1582_timestamp_int96_plain_v$version.snappy.parquet",
Copy link
Contributor

@cloud-fan cloud-fan May 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we have 2 files for plain and dictionary-encoding for int96? other types just have one file and 2 columns.

if it's caused by some parquet limitation, let's write a comment to explain it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because INT96 always use dictionary encoding independent from number of values and theirs uniqueness. I have to explicitly turn off dictionary encoding while saving to parquet files, see the test above.

Other types don't have such "problem" - for one column parquet lib uses dict encoding because all values are unique, for another one it applies plain enc because all values in date/timestamp columns are the same.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment

@SparkQA
Copy link

SparkQA commented May 25, 2020

Test build #123092 has finished for PR 28630 at commit 3f2b474.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented May 25, 2020

jenkins, retest this, please

@SparkQA
Copy link

SparkQA commented May 26, 2020

Test build #123097 has finished for PR 28630 at commit 3f2b474.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.0!

@cloud-fan cloud-fan closed this in 7e4f5bb May 26, 2020
cloud-fan pushed a commit that referenced this pull request May 26, 2020
…rquet: dictionary encoding, w/o Spark version

### What changes were proposed in this pull request?
1. Add the following parquet files to the resource folder `sql/core/src/test/resources/test-data`:
   - Files saved by Spark 2.4.5 (cee4ecb) without meta info `org.apache.spark.version`
      - `before_1582_date_v2_4_5.snappy.parquet` with 2 date columns of the type **INT32 L:DATE** - `PLAIN` (8 date values of `1001-01-01`) and `PLAIN_DICTIONARY` (`1001-01-01`..`1001-01-08`).
      - `before_1582_timestamp_micros_v2_4_5.snappy.parquet` with 2 timestamp columns of the type **INT64 L:TIMESTAMP(MICROS,true)** - `PLAIN` (8 date values of `1001-01-01 01:02:03.123456`) and `PLAIN_DICTIONARY` (`1001-01-01 01:02:03.123456`..`1001-01-08 01:02:03.123456`).
      - `before_1582_timestamp_millis_v2_4_5.snappy.parquet` with 2 timestamp columns of the type **INT64 L:TIMESTAMP(MILLIS,true)** - `PLAIN` (8 date values of `1001-01-01 01:02:03.123`) and `PLAIN_DICTIONARY` (`1001-01-01 01:02:03.123`..`1001-01-08 01:02:03.123`).
      - `before_1582_timestamp_int96_plain_v2_4_5.snappy.parquet` with 2 timestamp columns of the type **INT96** - `PLAIN` (8 date values of `1001-01-01 01:02:03.123456`) and `PLAIN` (`1001-01-01 01:02:03.123456`..`1001-01-08 01:02:03.123456`).
      - `before_1582_timestamp_int96_dict_v2_4_5.snappy.parquet` with 2 timestamp columns of the type **INT96** - `PLAIN_DICTIONARY` (8 date values of `1001-01-01 01:02:03.123456`) and `PLAIN_DICTIONARY` (`1001-01-01 01:02:03.123456`..`1001-01-08 01:02:03.123456`).
    - Files saved by Spark 2.4.6-rc3 (570848d) with the meta info `org.apache.spark.version = 2.4.6`:
      - `before_1582_date_v2_4_6.snappy.parquet` replaces `before_1582_date_v2_4.snappy.parquet`. And it is similar to `before_1582_date_v2_4_5.snappy.parquet` except Spark version in parquet meta info.
      - `before_1582_timestamp_micros_v2_4_6.snappy.parquet` replaces `before_1582_timestamp_micros_v2_4.snappy.parquet`. And it is similar to `before_1582_timestamp_micros_v2_4_5.snappy.parquet` except meta info.
      - `before_1582_timestamp_millis_v2_4_6.snappy.parquet` replaces `before_1582_timestamp_millis_v2_4.snappy.parquet`. And it is similar to `before_1582_timestamp_millis_v2_4_5.snappy.parquet` except meta info.
      - `before_1582_timestamp_int96_plain_v2_4_6.snappy.parquet` is similar to `before_1582_timestamp_int96_dict_v2_4_5.snappy.parquet` except meta info.
      - `before_1582_timestamp_int96_dict_v2_4_6.snappy.parquet` replaces `before_1582_timestamp_int96_v2_4.snappy.parquet`. And it is similar to `before_1582_timestamp_int96_dict_v2_4_5.snappy.parquet` except meta info.
2. Add new test "generate test files for checking compatibility with Spark 2.4" to `ParquetIOSuite` (marked as ignored). The parquet files above were generated by this test.
3. Modified "SPARK-31159: compatibility with Spark 2.4 in reading dates/timestamps" in `ParquetIOSuite` to use new parquet files.

### Why are the changes needed?
To improve test coverage.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running `ParquetIOSuite`.

Closes #28630 from MaxGekk/parquet-files-update.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 7e4f5bb)
Signed-off-by: Wenchen Fan <[email protected]>
@HyukjinKwon
Copy link
Member

+1

@MaxGekk
Copy link
Member Author

MaxGekk commented May 26, 2020

@HyukjinKwon @cloud-fan This build failure #28630 (comment) has been fixed by #28639

cloud-fan pushed a commit that referenced this pull request May 26, 2020
### What changes were proposed in this pull request?
Modified formatting of expected timestamp strings in the test `JavaBeanDeserializationSuite`.`testSpark22000` to correctly format timestamps with **zero** seconds fraction. Current implementation outputs `.0` but must be empty string. From SPARK-31820 failure:
- should be `2020-05-25 12:39:17`
- but incorrect expected string is `2020-05-25 12:39:17.0`

### Why are the changes needed?
To make `JavaBeanDeserializationSuite` stable, and avoid test failures like #28630 (comment)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
I changed https://github.com/apache/spark/blob/7dff3b125de23a4d6ce834217ee08973b259414c/sql/core/src/test/java/test/org/apache/spark/sql/JavaBeanDeserializationSuite.java#L207 to
```java
new java.sql.Timestamp((System.currentTimeMillis() / 1000) * 1000),
```
to force zero seconds fraction.

Closes #28639 from MaxGekk/fix-JavaBeanDeserializationSuite.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit that referenced this pull request May 26, 2020
### What changes were proposed in this pull request?
Modified formatting of expected timestamp strings in the test `JavaBeanDeserializationSuite`.`testSpark22000` to correctly format timestamps with **zero** seconds fraction. Current implementation outputs `.0` but must be empty string. From SPARK-31820 failure:
- should be `2020-05-25 12:39:17`
- but incorrect expected string is `2020-05-25 12:39:17.0`

### Why are the changes needed?
To make `JavaBeanDeserializationSuite` stable, and avoid test failures like #28630 (comment)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
I changed https://github.com/apache/spark/blob/7dff3b125de23a4d6ce834217ee08973b259414c/sql/core/src/test/java/test/org/apache/spark/sql/JavaBeanDeserializationSuite.java#L207 to
```java
new java.sql.Timestamp((System.currentTimeMillis() / 1000) * 1000),
```
to force zero seconds fraction.

Closes #28639 from MaxGekk/fix-JavaBeanDeserializationSuite.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 87d34e6)
Signed-off-by: Wenchen Fan <[email protected]>
@MaxGekk MaxGekk deleted the parquet-files-update branch June 5, 2020 19:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants