Skip to content

Use REQUIRED repetition level for MAP keys in parquet writer#12808

Merged
findepi merged 2 commits intotrinodb:masterfrom
raunaqmorarka:map-key-required
Jun 14, 2022
Merged

Use REQUIRED repetition level for MAP keys in parquet writer#12808
findepi merged 2 commits intotrinodb:masterfrom
raunaqmorarka:map-key-required

Conversation

@raunaqmorarka
Copy link
Copy Markdown
Member

As per parquet spec, MAP key should be REQUIRED

Is this change a fix, improvement, new feature, refactoring, or other?

Fix

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

Optimized parquet writer

How would you describe this change to a non-technical end user or system administrator?

Parquet files produced by optimized writer will comply with spec at https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps

Documentation

(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

(x) No release notes entries required.
( ) Release notes entries required with the following suggested text:

Copy link
Copy Markdown
Member

@skrzypo987 skrzypo987 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@findepi
Copy link
Copy Markdown
Member

findepi commented Jun 13, 2022

Does this affect any connector? how?

@raunaqmorarka
Copy link
Copy Markdown
Member Author

raunaqmorarka commented Jun 13, 2022

Does this affect any connector? how?

It affects the parquet schema produced by hive and delta lake connectors, makes it compliant with the parquet spec defined for MAP type at https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps.
Parquet reader and writer could in theory use this information to optimise reading and writing of definition levels in case a type is REQUIRED, but that's an implementation detail.
It could also be used by the writer to enforce that REQUIRED map key is actually non-null while writing the data.

@findepi
Copy link
Copy Markdown
Member

findepi commented Jun 13, 2022

makes it compliant with the parquet spec defined for MAP type at https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps.

I agree it is better to adhere to the Parquet spec rather than not to.

OTOH, in #12658 @raunaqmorarka explicitly departed from Parquet spec to fix compatibility bug that manifested in real-life scenarios. This suggests the real-life Parquet readers or writers not always are 100% spec compliant.

So, what are the real-world consequences of this change?

@skrzypo987
Copy link
Copy Markdown
Member

skrzypo987 commented Jun 13, 2022

real-life Parquet readers or writers not always are 100% spec compliant.

They are always not compliant.

BTW, what does OTOH stands for?

@alexjo2144
Copy link
Copy Markdown
Member

what does OTOH stands for?

"On the other hand"

@findepi
Copy link
Copy Markdown
Member

findepi commented Jun 13, 2022

So, what are the real-world consequences of this change?

I need to know this before i can reasonably approve or disapprove the change.

@raunaqmorarka
Copy link
Copy Markdown
Member Author

So, what are the real-world consequences of this change?

I need to know this before i can reasonably approve or disapprove the change.

I didn't see an issue with hive or spark compatibility when testing it.
Hive's schema convertor also follows similar convention https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/convert/HiveSchemaConverter.java#L174
Same with Spark's schema convertor https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L675
It doesn't look like there was any particular reason behind why Trino code was deviating from the norm here. We should be consistent with spec and other engines until there is reason to do something differently.

@findepi
Copy link
Copy Markdown
Member

findepi commented Jun 14, 2022

Okay, so we want to do this to adhere better to the spec, which is a good thing.
We also believe it has no impact on end users, since in no real world circumstances they will notice.

Thanks for clarifying. I will tag as no-rn then.

@findepi findepi added the no-release-notes This pull request does not require release notes entry label Jun 14, 2022
@findepi findepi merged commit 3247bd2 into trinodb:master Jun 14, 2022
@raunaqmorarka raunaqmorarka deleted the map-key-required branch June 14, 2022 07:53
@github-actions github-actions bot added this to the 386 milestone Jun 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed no-release-notes This pull request does not require release notes entry

Development

Successfully merging this pull request may close these issues.

4 participants