Skip to content

Conversation

@nandorKollar
Copy link
Contributor

@nandorKollar nandorKollar commented Feb 1, 2019

What changes were proposed in this pull request?

A new, more flexible logical type API was introduced in parquet-mr 1.11.0 (based on the the Thrift field in parquet-format available for a while). This change migrates from the old (now deprecated) enum-based OriginalType API to this new logical type API.

In addition to replacing the deprecated API calls, this PR also introduces support for reading the new subtypes for different timestamp semantics.

Since parquet-mr 1.11.0 is not yet released, this is tested against a release candidate. Before merging, the additional repository should be deleted from pom.xml, which can only be done once parquet-mr 1.11.0 is released.

How was this patch tested?

Unit tests were added to the PR.

@nandorKollar
Copy link
Contributor Author

Thanks @attilapiros for the review, I'll address the issues you find soon.

@nandorKollar
Copy link
Contributor Author

Thanks @attilapiros for the review, fixed your findings.

@squito
Copy link
Contributor

squito commented Feb 5, 2019

CAn you mention briefly how backwards compatibility will work after these changes? I assume this is somehow handled within parquet itself -- just a pointer to the relevant info would help.

Also I assume I should not bother triggering tests yet, as automated builds will fail without a published version of parquet?

@nandorKollar
Copy link
Contributor Author

@dongjoon-hyun @squito could you please review this PR?

@nandorKollar
Copy link
Contributor Author

@squito parquet-mr 1.11.0 writes both the old and the new logical types (converted_type and logicalType) in the Thrift schema, so old readers (who know only about converted_type) are able to read the annotation as long as there's a corresponding logicalType for the converted_type. Parquet-mr handles this conversion internally. For all legacy converted_type there's a corresponding logicalType, but since converted_type are deprecated, newly introduce logicalTypes might not have corresponding converted_type (for example timestamp with nano precision doesn't have any corresponding converted_type). In this case old readers will just see the physical type.

As of reading old files where new logical types are not present in the schema, only converted_type is taken into account, and parquet-mr takes care of the conversion to logical type representation internally. The conversion rules between original_types and logicalTypes are documented in parquet-format. Does this answer your question?

@dongjoon-hyun
Copy link
Member

ok to test.

@dongjoon-hyun
Copy link
Member

Thank you for pinging me, @nandorKollar . Let's ping @rdblue since this is Parquet 1.11.0.

@SparkQA
Copy link

SparkQA commented Feb 8, 2019

Test build #102094 has finished for PR 23721 at commit b9ced55.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • sealed trait MLEvent extends SparkListenerEvent
  • case class TransformStart() extends MLEvent
  • case class TransformEnd() extends MLEvent
  • case class FitStart[M <: Model[M]]() extends MLEvent
  • case class FitEnd[M <: Model[M]]() extends MLEvent
  • case class LoadInstanceStart[T](path: String) extends MLEvent
  • case class LoadInstanceEnd[T]() extends MLEvent
  • case class SaveInstanceStart(path: String) extends MLEvent
  • case class SaveInstanceEnd(path: String) extends MLEvent
  • class _DynamicModuleFuncGlobals(dict):
  • class LogisticRegressionModel(JavaModel, JavaClassificationModel, JavaMLWritable, JavaMLReadable,
  • class GaussianMixtureModel(JavaModel, JavaMLWritable, JavaMLReadable, HasTrainingSummary):
  • class KMeansModel(JavaModel, GeneralJavaMLWritable, JavaMLReadable, HasTrainingSummary):
  • class BisectingKMeansModel(JavaModel, JavaMLWritable, JavaMLReadable, HasTrainingSummary):
  • class LinearRegressionModel(JavaModel, JavaPredictionModel, GeneralJavaMLWritable, JavaMLReadable,
  • class HasTrainingSummary(object):

@attilapiros
Copy link
Contributor

@nandorKollar I would try to rebase this PR on top the master (as you have not touched the above classes).

So update your fork with the official spark then execute this on your branch:

$ git rebase origin/master

@SparkQA
Copy link

SparkQA commented Feb 8, 2019

Test build #102097 has finished for PR 23721 at commit 4d8fc37.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 8, 2019

Test build #102098 has finished for PR 23721 at commit 5a3df02.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@nandorKollar nandorKollar changed the title [SPARK-26797][SQL][WIP] Start using the new logical types API of Parquet 1.11.0 instead of the deprecated one [SPARK-26797][SQL][WIP][test-maven] Start using the new logical types API of Parquet 1.11.0 instead of the deprecated one Feb 8, 2019
@nandorKollar
Copy link
Contributor Author

Retest this please.

@attilapiros
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Feb 8, 2019

Test build #102102 has finished for PR 23721 at commit 5a3df02.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue
Copy link
Contributor

rdblue commented Feb 8, 2019

Thanks for pinging me, @dongjoon-hyun. I'd definitely like to review this but probably not before 1.11.0 is released. Until then, I think time is better spent validating the release.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @nandorKollar . Please remove this file because this is for Spark 3.0.

@SparkQA
Copy link

SparkQA commented Mar 25, 2019

Test build #103917 has finished for PR 23721 at commit ca9bdc3.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 25, 2019

Test build #103918 has finished for PR 23721 at commit f8ecac1.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@squito
Copy link
Contributor

squito commented Mar 25, 2019

Sorry @nandorKollar, real python style failures. you can run dev/lint-python locally

pycodestyle checks failed:
./dev/run-tests.py:298:1: E101 indentation contains mixed spaces and tabs
./dev/run-tests.py:298:1: W191 indentation contains tabs
./dev/run-tests.py:298:2: E128 continuation line under-indented for visual indent
./dev/run-tests.py:299:1: E101 indentation contains mixed spaces and tabs

@nandorKollar
Copy link
Contributor Author

Thanks @squito hope I fixed every Python error.

@SparkQA
Copy link

SparkQA commented Mar 25, 2019

Test build #103919 has finished for PR 23721 at commit c9a50e4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@nandorKollar
Copy link
Contributor Author

Tests passed with parquet-mr 1.11.0 release candidate, pinging @squito @rdblue @dongjoon-hyun, @HyukjinKwon for review.

@felixcheung
Copy link
Member

that's great, honestly we can't merge this until parquet 1.11 is released officially.

@nandorKollar
Copy link
Contributor Author

@felixcheung yes, I know, that's why it is still tagged with WIP. I opened the PR before the official release, because I thought that until it gets released I could get some useful feedback.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big deal but sql/core/src/test/resources/test-data/timestamp_dictionary.parq - should it be timestamp_dictionary.parquet?

@HyukjinKwon
Copy link
Member

cc @cloud-fan, @wangyum, and @liancheng FYI

@gatorsmile
Copy link
Member

Anybody is using Parquet 1.11 in their production systems? What is the major motivation to upgrade the parquet version just after a new version of Parquet is released?

@zivanfi
Copy link

zivanfi commented Apr 18, 2019

Parquet 1.11 is not released yet, but the release candidate provides several new features, improvements and bugfixes, most notably:

  • A new (but backwards- and forwards-compatible) logical type system that allows representation of timestamps with different semantics.
  • Column indexes, which allows pinpointing pages matching query conditions, leading to significant performance improvements in highly selective queries.

@nandorKollar
Copy link
Contributor Author

I recently found an other reason for upgrading Parquet version to 1.11.0 (once released): due to PARQUET-1472 when reading decimals with fixed size byte array underlying physical type, the dictionary filter could incorrectly drop row groups, silently giving back wrong results. The fix for this problem is only present in 1.11.0 as of now.

@felixcheung
Copy link
Member

what's the status of parquet 1.11?

@wangyum
Copy link
Member

wangyum commented Nov 13, 2019

@wangyum
Copy link
Member

wangyum commented Nov 13, 2019

retest this please

@SparkQA
Copy link

SparkQA commented Nov 13, 2019

Test build #113713 has finished for PR 23721 at commit c9a50e4.

  • This patch fails build dependency tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

Parquet 1.11.0 is officially released, no need to use snapshot.
@SparkQA
Copy link

SparkQA commented Jan 3, 2020

Test build #116098 has finished for PR 23721 at commit 9dbf152.

  • This patch fails build dependency tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@h-vetinari
Copy link
Contributor

What's the current status of this PR?

Parquet 1.11 has been release, but spark is still waiting for 1.11.1. Incidentally, the parquet 1.11.1 release is waiting for spark feedback on the snapshot. Maybe that's could be done here?

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label May 31, 2020
@github-actions github-actions bot closed this Jun 1, 2020
@h-vetinari
Copy link
Contributor

@nandorKollar
Should this PR be revived? Would you mind if I open a new one based on your changes? I tried rebasing your branch on master, but there's a whole lot of conflicts, so before I try to resolve them, I wanted to ask what the status is here.

CC @dongjoon-hyun @wangyum

@sunchao
Copy link
Member

sunchao commented Feb 28, 2021

+1. Parquet community (format/mr/cpp etc) has moved from the original enum-based converted type to the new union-based logical type with richer metadata, so it'd be great to see Spark adapt that too.

@h-vetinari
Copy link
Contributor

Attempt to carry this PR in #31685.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.