PARQUET-1879 MapKeyValue is not a valid Logical Type #798

maccamlc · 2020-06-23T12:15:12Z

Writing UNKNOWN logical type into the schema, causes a breakage
when parsing the file with Apache Arrow
Instead use the default, of falling back to null when that backwards-compatibility
only logical type is present, but still write the original type

Is this something that could be included in a 1.11.1 ?

Make sure you have checked all steps below.

Jira

My PR addresses the following Parquet Jira issues and references them in the PR title.
- https://issues.apache.org/jira/browse/PARQUET-1879
- In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.

Tests

My PR adds the following unit tests OR does not need testing for this extremely good reason:

Added a new test into TestParquetMetadataConverter#testMapLogicalType that verifies that when a Map schema is created, it leaves Logical Type as null when writing out the MapKeyValue Converted Type.

This is due to the LogicalType being required elsewhere for backwards compatibility, but when writing the Schema to Thrift, it needs to be left out, as writing UNKNOWN as occurs now is causing breakages when later trying to read the file using Apache Arrow in Snowflake Cloud Database.

Commits

My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Documentation

In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain Javadoc that explain what it does

N/A

gszadovszky

Thanks for working on this.

You have changed every naming from "map" to "key_value" in the tests. This is good for the expected data but we should keep testing "map" at the read path as well. Based on the spec it is still acceptable.

I am not an expert in this topic so I would be happy if someone else also could review this.

parquet-column/src/main/java/org/apache/parquet/schema/Types.java

...t-hadoop/src/test/java/org/apache/parquet/format/converter/TestParquetMetadataConverter.java

* Writing UNKNOWN logical type into the schema, causes a breakage when parsing the file with Apache Arrow * Instead use the default, of falling back to null when that backwards-compatibility only logical type is present, but still write the original type

maccamlc · 2020-06-25T04:04:59Z

Thanks for working on this.

You have changed every naming from "map" to "key_value" in the tests. This is good for the expected data but we should keep testing "map" at the read path as well. Based on the spec it is still acceptable.

I am not an expert in this topic so I would be happy if someone else also could review this.

@gszadovszky no problem. I have tried to add a test to verify the backwards-compatibile reading. Added TestReadWriteMapKeyValue to the commit.

Not sure if this is the correct way, but parsing a schema to go the logical type path with key_value and no MAP_KEY_VALUE type, then another with map and with the MAP_KEY_VALUE type.

From what I can tell the name is not actually verified anywhere (I tried with random name value too :) ), but both test paths are successful.

Hopefully it's ok, but let me know if might need to go a bit deeper somewhere else

gszadovszky

Thank you for creating the backward compatibility test for Map. It should have been existed already.
Unfortunately, this way you do not properly test backward compatibility. The problem is you cannot generate an "old" file with the "new" library. To be more precise the message parser is more for convenience and not used while reading/writing a parquet file. When you say you are testing converted type it is not really true because the parser tries to read logical types at the first place. Also the parquet writer writes both logical types and converted types so you cannot validate old files that have only converted types.
I would suggest adding tests that covers the examples in the spec by creating the thrift generated format objects and convert them by ParquetMetadataConverter just like you did it in TestParquetMetadataConverter.testMapLogicalType. Maybe, these tests would fit better in that class as well.
I should have been described this before you've implemented this test. I am sorry about that.

Please don't force push your changes because it makes harder to track the review. The committer will squash the PR before merging it anyway.

maccamlc · 2020-06-25T11:28:12Z

Thank you for creating the backward compatibility test for Map. It should have been existed already.
Unfortunately, this way you do not properly test backward compatibility. The problem is you cannot generate an "old" file with the "new" library. To be more precise the message parser is more for convenience and not used while reading/writing a parquet file. When you say you are testing converted type it is not really true because the parser tries to read logical types at the first place. Also the parquet writer writes both logical types and converted types so you cannot validate old files that have only converted types.
I would suggest adding tests that covers the examples in the spec by creating the thrift generated format objects and convert them by ParquetMetadataConverter just like you did it in TestParquetMetadataConverter.testMapLogicalType. Maybe, these tests would fit better in that class as well.
I should have been described this before you've implemented this test. I am sorry about that.

Please don't force push your changes because it makes harder to track the review. The committer will squash the PR before merging it anyway.

Apologies for the force push. Good to know that squashed on commit.

And thanks for the detailed reply.

I think I got it this time :)

Tests were moved into TestParquetMetadataConverter and for the old format test, building the metadata through Thrift SchemaElements.

Regards,
Matt

* Writing UNKNOWN logical type into the schema, causes a breakage when parsing the file with Apache Arrow * Instead use the default, of falling back to null when that backwards-compatibility only logical type is present, but still write the original type

gszadovszky

Thanks a lot. It looks good to me.

Let me wait a couple of days if anyone else would like to review. (I hope that so.)

maccamlc · 2020-06-27T09:06:31Z

@gszadovszky before this gets merged, I just wanted to clarify something myself after looking more into the format spec, that might tidy this issue up further.

Is MAP_KEY_VALUE required to still be written as the Converted Type when creating new files?

From what I could see from some older issues, such as PARQUET-335 and the backwards-compatibility rules it seems to have always been an optional type, and also used incorrectly in the past.

It appears that older versions of Parquet would be able to read the Map type in the schema without MAP_KEY_VALUE.

If that is true, I would probably suggest pushing this additional commit that I tested, onto this PR.

It would mean that any unexpected uses of LogicalType.MAP_KEY_VALUE would result in UNKNOWN being written to the file. But it is removed from the ConversionPatterns path, meaning that my case of this occuring when converting an Avro schema is still fixed, and tested.

Let me know if believe this might be the preferred fix, or if what have already done is better.

From what I see, it all should depend on whether the MAP_KEY_VALUE type is required as an Original Type or is ok being null for older readers?

Thanks
Matt

gszadovszky · 2020-06-29T07:47:55Z

@maccamlc,

The main problem I think is that the spec does not say anything about how the thrift objects shall be used. The specification is about the semantics of the schema and it is described using the parquet schema language. But, in the file there is no such language, we only have thrift objects.
When the specification says something about the logical types (e.g. MAP) it does not say anything about which thrift structure should be used (the converted type MAP or the logical type MAP).
We added the new logical type structures in the thrift to support enhanced ways to specify logical types (e.g. TimeStampType). The idea for backward compatibility was to write the old converted types wherever it make sense (the semantics of the actual logical type is the same as was before) along with the new logical type structures. So, related to MAP_KEY_VALUE, I think, we shall write it at the correct place if it was written before (prior to 1.11.0) and it helps for other readers but do not expect it to be there.

Cheers,
Gabor

maccamlc · 2020-06-29T08:21:57Z

@maccamlc,

The main problem I think is that the spec does not say anything about how the thrift objects shall be used. The specification is about the semantics of the schema and it is described using the parquet schema language. But, in the file there is no such language, we only have thrift objects.
When the specification says something about the logical types (e.g. MAP) it does not say anything about which thrift structure should be used (the converted type MAP or the logical type MAP).
We added the new logical type structures in the thrift to support enhanced ways to specify logical types (e.g. TimeStampType). The idea for backward compatibility was to write the old converted types wherever it make sense (the semantics of the actual logical type is the same as was before) along with the new logical type structures. So, related to MAP_KEY_VALUE, I think, we shall write it at the correct place if it was written before (prior to 1.11.0) and it helps for other readers but do not expect it to be there.

Cheers,
Gabor

Sounds good @gszadovszky . Thanks for some clarification.

Therefore, depending on any other comments from other reviewers, it seems this PR is still ready to merge as-is :)

* Writing UNKNOWN logical type into the schema, causes a breakage when parsing the file with Apache Arrow * Instead use the default, of falling back to null when that backwards-compatibility only logical type is present, but still write the original type (cherry picked from commit 2589cc8)

maccamlc force-pushed the PARQUET-1879_fix_mapkeyvalue_logical_type branch from 6a0bcf2 to e3d075a Compare June 23, 2020 12:21

gszadovszky requested changes Jun 24, 2020

View reviewed changes

parquet-column/src/main/java/org/apache/parquet/schema/Types.java Outdated Show resolved Hide resolved

...t-hadoop/src/test/java/org/apache/parquet/format/converter/TestParquetMetadataConverter.java Outdated Show resolved Hide resolved

maccamlc force-pushed the PARQUET-1879_fix_mapkeyvalue_logical_type branch from e3d075a to cc9cc5f Compare June 25, 2020 03:59

maccamlc requested a review from gszadovszky June 25, 2020 04:11

gszadovszky requested changes Jun 25, 2020

View reviewed changes

maccamlc force-pushed the PARQUET-1879_fix_mapkeyvalue_logical_type branch 2 times, most recently from da1d657 to bbd7d65 Compare June 25, 2020 11:25

maccamlc requested a review from gszadovszky June 25, 2020 11:28

maccamlc force-pushed the PARQUET-1879_fix_mapkeyvalue_logical_type branch from bbd7d65 to 7eddaec Compare June 25, 2020 11:30

gszadovszky approved these changes Jun 25, 2020

View reviewed changes

gszadovszky merged commit 2589cc8 into apache:master Jul 6, 2020

This was referenced Oct 18, 2021

Parquet 1.11.1 update causes regressions while reading iceberg data written with v1.11.0 apache/iceberg#2962

Closed

Parquet: Fix map projection after map to key_value rename apache/iceberg#3309

Merged

jorisvandenbossche mentioned this pull request Jan 18, 2024

[C++][Parquet] Cannot read Parquet file with map column generated by pyspark / parquet-mr < 1.12 apache/arrow#39540

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PARQUET-1879 MapKeyValue is not a valid Logical Type #798

PARQUET-1879 MapKeyValue is not a valid Logical Type #798

Uh oh!

maccamlc commented Jun 23, 2020 •

edited

Loading

Uh oh!

gszadovszky left a comment

Uh oh!

Uh oh!

Uh oh!

maccamlc commented Jun 25, 2020

Uh oh!

gszadovszky left a comment

Uh oh!

maccamlc commented Jun 25, 2020

Uh oh!

gszadovszky left a comment

Uh oh!

maccamlc commented Jun 27, 2020 •

edited

Loading

Uh oh!

gszadovszky commented Jun 29, 2020

Uh oh!

maccamlc commented Jun 29, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

PARQUET-1879 MapKeyValue is not a valid Logical Type #798

PARQUET-1879 MapKeyValue is not a valid Logical Type #798

Uh oh!

Conversation

maccamlc commented Jun 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Jira

Tests

Commits

Documentation

Uh oh!

gszadovszky left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

maccamlc commented Jun 25, 2020

Uh oh!

gszadovszky left a comment

Choose a reason for hiding this comment

Uh oh!

maccamlc commented Jun 25, 2020

Uh oh!

gszadovszky left a comment

Choose a reason for hiding this comment

Uh oh!

maccamlc commented Jun 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gszadovszky commented Jun 29, 2020

Uh oh!

maccamlc commented Jun 29, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

maccamlc commented Jun 23, 2020 •

edited

Loading

maccamlc commented Jun 27, 2020 •

edited

Loading