[SPARK-26797][SQL][WIP][test-maven] Start using the new logical types API of Parquet 1.11.0 instead of the deprecated one #23721

nandorKollar · 2019-02-01T09:55:34Z

What changes were proposed in this pull request?

A new, more flexible logical type API was introduced in parquet-mr 1.11.0 (based on the the Thrift field in parquet-format available for a while). This change migrates from the old (now deprecated) enum-based OriginalType API to this new logical type API.

In addition to replacing the deprecated API calls, this PR also introduces support for reading the new subtypes for different timestamp semantics.

Since parquet-mr 1.11.0 is not yet released, this is tested against a release candidate. Before merging, the additional repository should be deleted from pom.xml, which can only be done once parquet-mr 1.11.0 is released.

How was this patch tested?

Unit tests were added to the PR.

...src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

...e/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala

...re/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala

nandorKollar · 2019-02-01T14:31:02Z

Thanks @attilapiros for the review, I'll address the issues you find soon.

...src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java

.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala

...src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncodingSuite.scala

nandorKollar · 2019-02-05T08:55:17Z

Thanks @attilapiros for the review, fixed your findings.

squito · 2019-02-05T15:46:06Z

CAn you mention briefly how backwards compatibility will work after these changes? I assume this is somehow handled within parquet itself -- just a pointer to the relevant info would help.

Also I assume I should not bother triggering tests yet, as automated builds will fail without a published version of parquet?

nandorKollar · 2019-02-05T16:07:52Z

@dongjoon-hyun @squito could you please review this PR?

nandorKollar · 2019-02-06T14:32:12Z

@squito parquet-mr 1.11.0 writes both the old and the new logical types (converted_type and logicalType) in the Thrift schema, so old readers (who know only about converted_type) are able to read the annotation as long as there's a corresponding logicalType for the converted_type. Parquet-mr handles this conversion internally. For all legacy converted_type there's a corresponding logicalType, but since converted_type are deprecated, newly introduce logicalTypes might not have corresponding converted_type (for example timestamp with nano precision doesn't have any corresponding converted_type). In this case old readers will just see the physical type.

As of reading old files where new logical types are not present in the schema, only converted_type is taken into account, and parquet-mr takes care of the conversion to logical type representation internally. The conversion rules between original_types and logicalTypes are documented in parquet-format. Does this answer your question?

dongjoon-hyun · 2019-02-08T10:25:59Z

ok to test.

dongjoon-hyun · 2019-02-08T10:26:26Z

Thank you for pinging me, @nandorKollar . Let's ping @rdblue since this is Parquet 1.11.0.

SparkQA · 2019-02-08T10:31:06Z

Test build #102094 has finished for PR 23721 at commit b9ced55.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
sealed trait MLEvent extends SparkListenerEvent
case class TransformStart() extends MLEvent
case class TransformEnd() extends MLEvent
case class FitStart[M <: Model[M]]() extends MLEvent
case class FitEnd[M <: Model[M]]() extends MLEvent
case class LoadInstanceStart[T](path: String) extends MLEvent
case class LoadInstanceEnd[T]() extends MLEvent
case class SaveInstanceStart(path: String) extends MLEvent
case class SaveInstanceEnd(path: String) extends MLEvent
class _DynamicModuleFuncGlobals(dict):
class LogisticRegressionModel(JavaModel, JavaClassificationModel, JavaMLWritable, JavaMLReadable,
class GaussianMixtureModel(JavaModel, JavaMLWritable, JavaMLReadable, HasTrainingSummary):
class KMeansModel(JavaModel, GeneralJavaMLWritable, JavaMLReadable, HasTrainingSummary):
class BisectingKMeansModel(JavaModel, JavaMLWritable, JavaMLReadable, HasTrainingSummary):
class LinearRegressionModel(JavaModel, JavaPredictionModel, GeneralJavaMLWritable, JavaMLReadable,
class HasTrainingSummary(object):

attilapiros · 2019-02-08T12:37:14Z

@nandorKollar I would try to rebase this PR on top the master (as you have not touched the above classes).

So update your fork with the official spark then execute this on your branch:

$ git rebase origin/master

SparkQA · 2019-02-08T13:31:22Z

Test build #102097 has finished for PR 23721 at commit 4d8fc37.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-02-08T13:51:45Z

Test build #102098 has finished for PR 23721 at commit 5a3df02.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

nandorKollar · 2019-02-08T14:06:29Z

Retest this please.

attilapiros · 2019-02-08T14:40:31Z

retest this please

SparkQA · 2019-02-08T16:55:32Z

Test build #102102 has finished for PR 23721 at commit 5a3df02.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2019-02-08T17:01:31Z

Thanks for pinging me, @dongjoon-hyun. I'd definitely like to review this but probably not before 1.11.0 is released. Until then, I think time is better spent validating the release.

dongjoon-hyun · 2019-02-08T17:05:17Z

dev/deps/spark-deps-hadoop-2.6

Hi, @nandorKollar . Please remove this file because this is for Spark 3.0.

SparkQA · 2019-03-25T15:37:05Z

Test build #103917 has finished for PR 23721 at commit ca9bdc3.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-03-25T15:46:56Z

Test build #103918 has finished for PR 23721 at commit f8ecac1.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

squito · 2019-03-25T16:14:56Z

Sorry @nandorKollar, real python style failures. you can run dev/lint-python locally

pycodestyle checks failed:
./dev/run-tests.py:298:1: E101 indentation contains mixed spaces and tabs
./dev/run-tests.py:298:1: W191 indentation contains tabs
./dev/run-tests.py:298:2: E128 continuation line under-indented for visual indent
./dev/run-tests.py:299:1: E101 indentation contains mixed spaces and tabs

nandorKollar · 2019-03-25T16:28:48Z

Thanks @squito hope I fixed every Python error.

SparkQA · 2019-03-25T21:15:23Z

Test build #103919 has finished for PR 23721 at commit c9a50e4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

nandorKollar · 2019-03-26T15:49:15Z

Tests passed with parquet-mr 1.11.0 release candidate, pinging @squito @rdblue @dongjoon-hyun, @HyukjinKwon for review.

felixcheung · 2019-03-27T05:44:26Z

that's great, honestly we can't merge this until parquet 1.11 is released officially.

nandorKollar · 2019-03-27T09:27:08Z

@felixcheung yes, I know, that's why it is still tagged with WIP. I opened the PR before the official release, because I thought that until it gets released I could get some useful feedback.

HyukjinKwon

Not a big deal but sql/core/src/test/resources/test-data/timestamp_dictionary.parq - should it be timestamp_dictionary.parquet?

HyukjinKwon · 2019-04-01T23:32:16Z

cc @cloud-fan, @wangyum, and @liancheng FYI

gatorsmile · 2019-04-17T22:48:47Z

Anybody is using Parquet 1.11 in their production systems? What is the major motivation to upgrade the parquet version just after a new version of Parquet is released?

zivanfi · 2019-04-18T13:13:42Z

Parquet 1.11 is not released yet, but the release candidate provides several new features, improvements and bugfixes, most notably:

A new (but backwards- and forwards-compatible) logical type system that allows representation of timestamps with different semantics.
Column indexes, which allows pinpointing pages matching query conditions, leading to significant performance improvements in highly selective queries.

nandorKollar · 2019-07-11T09:51:20Z

I recently found an other reason for upgrading Parquet version to 1.11.0 (once released): due to PARQUET-1472 when reading decimals with fixed size byte array underlying physical type, the dictionary filter could incorrectly drop row groups, silently giving back wrong results. The fix for this problem is only present in 1.11.0 as of now.

felixcheung · 2019-07-15T06:07:27Z

what's the status of parquet 1.11?

wangyum · 2019-11-13T14:38:39Z

@felixcheung It's still in voting: http://mail-archives.apache.org/mod_mbox/parquet-dev/201911.mbox/%3CCAAhnZP7Y-%2BY_Y-jagAx0G5TEczyfDAdAQt7QdBbk-7pEjiPbXg%40mail.gmail.com%3E

wangyum · 2019-11-13T14:38:52Z

retest this please

SparkQA · 2019-11-13T15:00:02Z

Test build #113713 has finished for PR 23721 at commit c9a50e4.

This patch fails build dependency tests.
This patch does not merge cleanly.
This patch adds no public classes.

Parquet 1.11.0 is officially released, no need to use snapshot.

SparkQA · 2020-01-03T15:36:03Z

Test build #116098 has finished for PR 23721 at commit 9dbf152.

This patch fails build dependency tests.
This patch does not merge cleanly.
This patch adds no public classes.

h-vetinari · 2020-02-20T14:34:16Z

What's the current status of this PR?

Parquet 1.11 has been release, but spark is still waiting for 1.11.1. Incidentally, the parquet 1.11.1 release is waiting for spark feedback on the snapshot. Maybe that's could be done here?

github-actions · 2020-05-31T00:22:17Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

h-vetinari · 2021-02-27T18:34:50Z

@nandorKollar
Should this PR be revived? Would you mind if I open a new one based on your changes? I tried rebasing your branch on master, but there's a whole lot of conflicts, so before I try to resolve them, I wanted to ask what the status is here.

CC @dongjoon-hyun @wangyum

sunchao · 2021-02-28T07:55:47Z

+1. Parquet community (format/mr/cpp etc) has moved from the original enum-based converted type to the new union-based logical type with richer metadata, so it'd be great to see Spark adapt that too.

h-vetinari · 2021-02-28T17:11:17Z

Attempt to carry this PR in #31685.