Skip to content

Conversation

@h-vetinari
Copy link
Contributor

Trying to revive #23721 from @nandorKollar:

What changes were proposed in this pull request?

A new, more flexible logical type API was introduced in parquet-mr 1.11.0 (based on the the Thrift field in parquet-format available for a while). This change migrates from the old (now deprecated) enum-based OriginalType API to this new logical type API.

In addition to replacing the deprecated API calls, this PR also introduces support for reading the new subtypes for different timestamp semantics.

Since parquet-mr 1.11.0 is not yet released, this is tested against a release candidate. Before merging, the additional repository should be deleted from pom.xml, which can only be done once parquet-mr 1.11.0 is released.

How was this patch tested?

Unit tests were added to the PR.

I intentionally left the conflicts in the merge commit, so that it becomes clear how I've chosen (on a best effort basis...) to resolve them - this is obviously WIP.

Also, please note that this is my first PR for spark, so I'm probably in above my head, and happy to close this PR if desired (or take any advice).

nkollar and others added 26 commits March 25, 2019 10:25
ParquetPartitionDiscoverySuite failed when executed after ParquetInteroperabilitySuite using Maven
The reasion for that is, that ParquetInteroperabilitySuite changes the timezone in one test case,
but doesn't restore the original. This could be easily fixed by restoring the original timezone in
a finally block.
Parquet 1.11.0 is officially released, no need to use snapshot.
Conflicts:
    dev/deps/spark-deps-hadoop-2.7
    dev/deps/spark-deps-hadoop-3.2
    dev/run-tests.py
    pom.xml
    sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java
    sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java
    sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
    sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
    sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala
    sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRecordMaterializer.scala
    sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala
    sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala
    sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/InMemoryColumnarQuerySuite.scala
    sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelationSuite.scala
    sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncodingSuite.scala
    sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
    sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala
    sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetInteroperabilitySuite.scala
sessionLocalTz and convertTz were doing the same thing;
keep the version from master.
@srowen
Copy link
Member

srowen commented Mar 2, 2021

Jenkins test this please

@SparkQA
Copy link

SparkQA commented Mar 2, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40249/

@h-vetinari
Copy link
Contributor Author

Hi @srowen, thanks for stopping by. :)

I think this'll need more work - there's a bunch of things that happened in between, not least the switch to Java 8+ datetime APIs, the addition of the timestamp rebasing, as well as convertTz, which seems to do something very similar to sessionLocalTz.

Happy to take any pointers you'd have.

@srowen
Copy link
Member

srowen commented Mar 2, 2021

Oh OK not sure myself. I see some tests to update to Parquet 1.12 already too

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Jun 11, 2021
@github-actions github-actions bot closed this Jun 12, 2021
@h-vetinari h-vetinari deleted the parquet_logical branch July 17, 2025 05:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants