Skip to content

Conversation

@rdblue
Copy link
Contributor

@rdblue rdblue commented Oct 18, 2021

This fixes the Parquet map projection bug introduced by apache/parquet-java#798

The projection code in Iceberg would create map projections by using the Parquet Types.map builder. But, the type created by this builder changed by renaming the key-value pair, map to key_value, so the projection was no longer valid for Parquet files. As a result, Parquet would not project the map column and loading it would fail with an error like this:

Caused by: java.lang.IllegalArgumentException: [mapCol, map, key] required binary key (STRING) = 2 is not in the store: [] 1000
        at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ColumnChunkPageReadStore.getPageReader(ColumnChunkPageReadStore.java:272)
        at org.apache.iceberg.parquet.ParquetValueReaders$PrimitiveReader.setPageSource(ParquetValueReaders.java:185)
        at org.apache.iceberg.parquet.ParquetValueReaders$RepeatedKeyValueReader.setPageSource(ParquetValueReaders.java:529)
        at org.apache.iceberg.parquet.ParquetValueReaders$StructReader.setPageSource(ParquetValueReaders.java:685)

The solution is to copy the map structure and ensure that the names are preserved rather than generated.

Closes #2962.

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Is there a non painful way we can add a test for this? Seems like manually testing it requires loading up a Parquet Version which defines map types with slightly different names?

@rdblue
Copy link
Contributor Author

rdblue commented Oct 18, 2021

We can add a test for PruneColumns directly that uses an alternative map structure. The only issue is that I had some trouble building the map type without going through the Types API. But there's probably a different way to do it that I was missing.

@rdblue rdblue added this to the Java 0.12.1 Release milestone Oct 18, 2021
@rdblue
Copy link
Contributor Author

rdblue commented Oct 19, 2021

Added the missing tests. I'll merge this when tests are passing.

.addField(Types.primitive(PrimitiveTypeName.DOUBLE, Type.Repetition.REQUIRED).id(5).named("y"))
.addField(Types.primitive(PrimitiveTypeName.DOUBLE, Type.Repetition.REQUIRED).id(6).named("z"))
.id(3)
.named("value"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite the type declaration. Love it.

Copy link
Contributor

@kbendick kbendick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @rdblue.

.addField(Types.primitive(PrimitiveTypeName.DOUBLE, Type.Repetition.REQUIRED).id(5).named("y"))
.addField(Types.primitive(PrimitiveTypeName.DOUBLE, Type.Repetition.REQUIRED).id(6).named("z"))
.id(3)
.named("value"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite the type declaration. Love it.

}

@Test
public void testListElementName() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe testListElementDoesNotAssumeName?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I were making another change I'd probably update this, but I think it's pretty minor and I'd like to get this in to make it possible to release 0.12.1.

@kbendick
Copy link
Contributor

Added the missing tests. I'll merge this when tests are passing.

These are great tests and this is great work. Thank you!

@rdblue rdblue merged commit edc6985 into apache:master Oct 19, 2021
kbendick pushed a commit to kbendick/iceberg that referenced this pull request Oct 27, 2021
kbendick pushed a commit to kbendick/iceberg that referenced this pull request Oct 27, 2021
kbendick pushed a commit to kbendick/iceberg that referenced this pull request Oct 28, 2021
izchen pushed a commit to izchen/iceberg that referenced this pull request Dec 7, 2021
Initial-neko pushed a commit to Initial-neko/iceberg that referenced this pull request Dec 15, 2021
Initial-neko pushed a commit to Initial-neko/iceberg that referenced this pull request Dec 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parquet 1.11.1 update causes regressions while reading iceberg data written with v1.11.0

3 participants