[SPARK-26797][SQL][WIP][test-maven] Start using the new logical types API of Parquet 1.11.1 instead of the deprecated one #31685

h-vetinari · 2021-02-28T17:07:19Z

Trying to revive #23721 from @nandorKollar:

What changes were proposed in this pull request?

A new, more flexible logical type API was introduced in parquet-mr 1.11.0 (based on the the Thrift field in parquet-format available for a while). This change migrates from the old (now deprecated) enum-based OriginalType API to this new logical type API.

In addition to replacing the deprecated API calls, this PR also introduces support for reading the new subtypes for different timestamp semantics.

Since parquet-mr 1.11.0 is not yet released, this is tested against a release candidate. Before merging, the additional repository should be deleted from pom.xml, which can only be done once parquet-mr 1.11.0 is released.

How was this patch tested?

Unit tests were added to the PR.

I intentionally left the conflicts in the merge commit, so that it becomes clear how I've chosen (on a best effort basis...) to resolve them - this is obviously WIP.

Also, please note that this is my first PR for spark, so I'm probably in above my head, and happy to close this PR if desired (or take any advice).

ParquetPartitionDiscoverySuite failed when executed after ParquetInteroperabilitySuite using Maven The reasion for that is, that ParquetInteroperabilitySuite changes the timezone in one test case, but doesn't restore the original. This could be easily fixed by restoring the original timezone in a finally block.

Parquet 1.11.0 is officially released, no need to use snapshot.

Conflicts: dev/deps/spark-deps-hadoop-2.7 dev/deps/spark-deps-hadoop-3.2 dev/run-tests.py pom.xml sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRecordMaterializer.scala sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/InMemoryColumnarQuerySuite.scala sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelationSuite.scala sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncodingSuite.scala sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetInteroperabilitySuite.scala

sessionLocalTz and convertTz were doing the same thing; keep the version from master.

srowen · 2021-03-02T14:56:32Z

Jenkins test this please

SparkQA · 2021-03-02T15:34:42Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40249/

h-vetinari · 2021-03-02T16:35:27Z

Hi @srowen, thanks for stopping by. :)

I think this'll need more work - there's a bunch of things that happened in between, not least the switch to Java 8+ datetime APIs, the addition of the timestamp rebasing, as well as convertTz, which seems to do something very similar to sessionLocalTz.

Happy to take any pointers you'd have.

srowen · 2021-03-02T17:38:50Z

Oh OK not sure myself. I see some tests to update to Parquet 1.12 already too

github-actions · 2021-06-11T00:08:56Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

nkollar and others added 26 commits March 25, 2019 10:25

Timezone adjustment

fb5df20

Generate file with epoch (called rawValue) value as id

f90d63e

Fix predicate pushdown - equals

82a06bf

refactor

1b67c87

Parquet pushdown - without lt and lteq predicates

dc8c3bb

Fix lt and lteq predicates with timestamps

73e431b

Fix style issues, remove whitespaces

b6fe694

Fix failing tests

5fa5769

Upgrade to Parquet 1.11.0 release candidate

cb6f06c

Fix name

4514386

Address code review comments

d93ecb8

Add staging repository to use parquet-mr 1.11.0 RC3

21af765

Update manifest

8d4b06c

Fix failing tests, address code review comments

fbe3039

Use direct snapshot url

4deff14

Change updatePolicy to always

afc7564

remove duplicated repository

2998df2

Use 1.12.0-SNAPSHOT

440a9b3

update deps file

6e95803

fix manifest file

9a39876

purge Parquet artifacts from local repository

ca9bdc3

line too long

f8ecac1

fix python style error

c9a50e4

Remove snapshot repo

9dbf152

Parquet 1.11.0 is officially released, no need to use snapshot.

h-vetinari marked this pull request as draft February 28, 2021 17:07

github-actions bot added BUILD SQL labels Feb 28, 2021

h-vetinari mentioned this pull request Feb 28, 2021

[SPARK-26797][SQL][WIP][test-maven] Start using the new logical types API of Parquet 1.11.0 instead of the deprecated one #23721

Closed

h-vetinari force-pushed the parquet_logical branch from 695bb97 to 687e44b Compare March 1, 2021 17:52

resolve merge conflicts, enough to compile at least

06aaa64

sessionLocalTz and convertTz were doing the same thing; keep the version from master.

h-vetinari force-pushed the parquet_logical branch from e42ade0 to 06aaa64 Compare March 1, 2021 22:15

sunchao mentioned this pull request Mar 29, 2021

[SPARK-34661][SQL] Clean up OriginalType and DecimalMetadata usage in Parquet related code #31776

Closed

github-actions bot added the Stale label Jun 11, 2021

github-actions bot closed this Jun 12, 2021

h-vetinari deleted the parquet_logical branch July 17, 2025 05:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-26797][SQL][WIP][test-maven] Start using the new logical types API of Parquet 1.11.1 instead of the deprecated one #31685

[SPARK-26797][SQL][WIP][test-maven] Start using the new logical types API of Parquet 1.11.1 instead of the deprecated one #31685

Uh oh!

h-vetinari commented Feb 28, 2021

Uh oh!

srowen commented Mar 2, 2021

Uh oh!

SparkQA commented Mar 2, 2021

Uh oh!

h-vetinari commented Mar 2, 2021

Uh oh!

srowen commented Mar 2, 2021

Uh oh!

github-actions bot commented Jun 11, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-26797][SQL][WIP][test-maven] Start using the new logical types API of Parquet 1.11.1 instead of the deprecated one #31685

[SPARK-26797][SQL][WIP][test-maven] Start using the new logical types API of Parquet 1.11.1 instead of the deprecated one #31685

Uh oh!

Conversation

h-vetinari commented Feb 28, 2021

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

srowen commented Mar 2, 2021

Uh oh!

SparkQA commented Mar 2, 2021

Uh oh!

h-vetinari commented Mar 2, 2021

Uh oh!

srowen commented Mar 2, 2021

Uh oh!

github-actions bot commented Jun 11, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants