-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-26797][SQL][WIP][test-maven] Start using the new logical types API of Parquet 1.11.0 instead of the deprecated one #23721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
...src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
Outdated
Show resolved
Hide resolved
...e/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala
Outdated
Show resolved
Hide resolved
...re/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
Outdated
Show resolved
Hide resolved
|
Thanks @attilapiros for the review, I'll address the issues you find soon. |
...src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java
Outdated
Show resolved
Hide resolved
.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala
Outdated
Show resolved
Hide resolved
.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala
Outdated
Show resolved
Hide resolved
.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala
Outdated
Show resolved
Hide resolved
.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala
Outdated
Show resolved
Hide resolved
...src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncodingSuite.scala
Outdated
Show resolved
Hide resolved
...src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncodingSuite.scala
Outdated
Show resolved
Hide resolved
...src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncodingSuite.scala
Outdated
Show resolved
Hide resolved
|
Thanks @attilapiros for the review, fixed your findings. |
|
CAn you mention briefly how backwards compatibility will work after these changes? I assume this is somehow handled within parquet itself -- just a pointer to the relevant info would help. Also I assume I should not bother triggering tests yet, as automated builds will fail without a published version of parquet? |
|
@dongjoon-hyun @squito could you please review this PR? |
|
@squito parquet-mr 1.11.0 writes both the old and the new logical types (converted_type and logicalType) in the Thrift schema, so old readers (who know only about converted_type) are able to read the annotation as long as there's a corresponding logicalType for the converted_type. Parquet-mr handles this conversion internally. For all legacy converted_type there's a corresponding logicalType, but since converted_type are deprecated, newly introduce logicalTypes might not have corresponding converted_type (for example timestamp with nano precision doesn't have any corresponding converted_type). In this case old readers will just see the physical type. As of reading old files where new logical types are not present in the schema, only converted_type is taken into account, and parquet-mr takes care of the conversion to logical type representation internally. The conversion rules between original_types and logicalTypes are documented in parquet-format. Does this answer your question? |
|
ok to test. |
|
Thank you for pinging me, @nandorKollar . Let's ping @rdblue since this is Parquet 1.11.0. |
|
Test build #102094 has finished for PR 23721 at commit
|
|
@nandorKollar I would try to rebase this PR on top the master (as you have not touched the above classes). So update your fork with the official spark then execute this on your branch: |
|
Test build #102097 has finished for PR 23721 at commit
|
|
Test build #102098 has finished for PR 23721 at commit
|
|
Retest this please. |
|
retest this please |
|
Test build #102102 has finished for PR 23721 at commit
|
|
Thanks for pinging me, @dongjoon-hyun. I'd definitely like to review this but probably not before 1.11.0 is released. Until then, I think time is better spent validating the release. |
dev/deps/spark-deps-hadoop-2.6
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @nandorKollar . Please remove this file because this is for Spark 3.0.
|
Test build #103917 has finished for PR 23721 at commit
|
|
Test build #103918 has finished for PR 23721 at commit
|
|
Sorry @nandorKollar, real python style failures. you can run |
|
Thanks @squito hope I fixed every Python error. |
|
Test build #103919 has finished for PR 23721 at commit
|
|
Tests passed with parquet-mr 1.11.0 release candidate, pinging @squito @rdblue @dongjoon-hyun, @HyukjinKwon for review. |
|
that's great, honestly we can't merge this until parquet 1.11 is released officially. |
|
@felixcheung yes, I know, that's why it is still tagged with WIP. I opened the PR before the official release, because I thought that until it gets released I could get some useful feedback. |
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a big deal but sql/core/src/test/resources/test-data/timestamp_dictionary.parq - should it be timestamp_dictionary.parquet?
|
cc @cloud-fan, @wangyum, and @liancheng FYI |
|
Anybody is using Parquet 1.11 in their production systems? What is the major motivation to upgrade the parquet version just after a new version of Parquet is released? |
|
Parquet 1.11 is not released yet, but the release candidate provides several new features, improvements and bugfixes, most notably:
|
|
I recently found an other reason for upgrading Parquet version to 1.11.0 (once released): due to PARQUET-1472 when reading decimals with fixed size byte array underlying physical type, the dictionary filter could incorrectly drop row groups, silently giving back wrong results. The fix for this problem is only present in 1.11.0 as of now. |
|
what's the status of parquet 1.11? |
|
retest this please |
|
Test build #113713 has finished for PR 23721 at commit
|
Parquet 1.11.0 is officially released, no need to use snapshot.
|
Test build #116098 has finished for PR 23721 at commit
|
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
|
@nandorKollar |
|
+1. Parquet community (format/mr/cpp etc) has moved from the original enum-based converted type to the new union-based logical type with richer metadata, so it'd be great to see Spark adapt that too. |
|
Attempt to carry this PR in #31685. |
What changes were proposed in this pull request?
A new, more flexible logical type API was introduced in parquet-mr 1.11.0 (based on the the Thrift field in parquet-format available for a while). This change migrates from the old (now deprecated) enum-based OriginalType API to this new logical type API.
In addition to replacing the deprecated API calls, this PR also introduces support for reading the new subtypes for different timestamp semantics.
Since parquet-mr 1.11.0 is not yet released, this is tested against a release candidate. Before merging, the additional repository should be deleted from pom.xml, which can only be done once parquet-mr 1.11.0 is released.
How was this patch tested?
Unit tests were added to the PR.