Fix for Conversion of Parquet ByteArray to Iceberg Schema #2167

RussellSpitzer · 2021-01-28T02:12:03Z

Previously the Iceberg conversion functions for Parquet would throw an exception if
they encountered a Binary type field. This was internally represented as a repeated
primitive field that is not nested in another group type. This violated some expecations
within our schema conversion code.

We encountered this with a user who was using Parquet's AvroParquetWriter class to write Parquet files. The files, while readable by hive and spark, were not readable by iceberg.

Investigating this I found the following Avro Schema element caused the problem

  String schema = "{\n" +
        "   \"type\":\"record\",\n" +
        "   \"name\":\"DbRecord\",\n" +
        "   \"namespace\":\"com.russ\",\n" +
        "   \"fields\":[\n" +
        "      {\n" +
        "         \"name\":\"foo\",\n" +
        "         \"type\":[\n" +
        "            \"null\",\n" +
        "            {\n" +
        "               \"type\":\"array\",\n" +
        "               \"items\":\"bytes\"\n" +
        "            }\n" +
        "         ],\n" +
        "         \"default\":null\n" +
        "      }\n" +
        "   ]\n" +
        "}";

Parquet would convert this element into

foo:
OPTIONAL F:1
.array: REPEATED BINARY R:1 D:2

Which violates Iceberg's reader, which assumes the list will be nested.

Doing a quick test with

org.apache.avro.Schema.Parser parser = new org.apache.avro.Schema.Parser();
 org.apache.avro.Schema avroSchema = parser.parse(schema);
 AvroSchemaConverter converter = new AvroSchemaConverter();
 MessageType parquetSchema = converter.convert(avroSchema);

I saw that this was reproducible in the current version of Parquet and not just in our User's code.

To fix this I added some tests for this particular datatype and loosened some of the restrictions
in our Parquet Schema parsing code.

RussellSpitzer · 2021-01-28T02:12:33Z

@aokolnychyi + @rdblue This is the issue I was talking about before

Previously the Iceberg conversion functions for Parquet would throw an exception if they encountered a Binary type field. This was internally represented as a repeated primitive field that is not nested in another group type. This violated some expecations within our schema conversion code.

aokolnychyi · 2021-01-28T02:14:41Z

@rdblue, could you help on this one?

rdblue · 2021-01-28T21:33:39Z

This doesn't seem that bad. Can you add a test that writes a Parquet file with that schema and validates that Spark can read it? Maybe a TestMalformedParquet suite?

RussellSpitzer · 2021-01-28T21:35:14Z

@rdblue Just bad enough I hope, I really have no idea what's going on here at a spec level, i'm just trying to match what I see as valid from the internal apis :)

I'll add in another test

RussellSpitzer · 2021-01-29T01:36:08Z

@rdblue I have an additional issue :/

In SparkParquetReaders we call

      ColumnDescriptor desc = type.getColumnDescription(currentPath());

Which fails when traversing with

Arrived at primitive node, path invalid
org.apache.parquet.io.InvalidRecordException: Arrived at primitive node, path invalid
	at org.apache.parquet.schema.PrimitiveType.getMaxRepetitionLevel(PrimitiveType.java:665)
	at org.apache.parquet.schema.GroupType.getMaxRepetitionLevel(GroupType.java:294)
	at org.apache.parquet.schema.GroupType.getMaxRepetitionLevel(GroupType.java:294)
	at org.apache.parquet.schema.MessageType.getMaxRepetitionLevel(MessageType.java:77)
	at org.apache.parquet.schema.MessageType.getColumnDescription(MessageType.java:94)
	at org.apache.iceberg.spark.data.SparkParquetReaders$ReadBuilder.primitive(SparkParquetReaders.java:222)

I'm not sure what the "Desc" is supposed to be here. There is no issue with writing the file or reading the file using the generic Parquet.read() file. But once I use the SparkParquetReader I have the issue.

Do you have any ideas how this can be fixed?

RussellSpitzer · 2021-01-29T03:28:44Z

Figured out Array I may have to write a test for a tope level repeated binary array as well :/

RussellSpitzer · 2021-01-29T03:46:06Z

malformed_parquet_not_txt.txt

This is the Parquet file generated by the test, just in case anyone wants to take a look at what we are dealing with here with another framework.

parquet/src/main/java/org/apache/iceberg/parquet/TypeWithSchemaVisitor.java

spark/src/test/java/org/apache/iceberg/spark/data/TestMalformedParquetFromAvro.java

parquet/src/main/java/org/apache/iceberg/parquet/ParquetTypeVisitor.java

rdblue

Looks good overall.

RussellSpitzer · 2021-02-08T20:03:50Z

Formatting fixes in, @rdblue Thanks for looking over this, I really appreciate it

rdblue · 2021-02-08T20:50:29Z

I've been thinking about this more and I'm leaning toward trying to work around it. I think the problem is that the Parquet/Avro writer uses the old list format by default to avoid breaking existing pipelines. But there should be an easy way to update the behavior to produce records that Iceberg accepts by setting parquet.avro.write-old-list-structure=false. That's true by default.

If we can fix it that way, then I think we should go with that. Otherwise, we're implementing only part of the backward-compatibility rules from Parquet. I'm not sure what the impact would be on compatibility if we only partially implement the rules, so I think the safer thing is to just implement all of the backward-compatibility rules. But that's a bigger change and more to maintain (which is why we don't support the 2-level lists in the first place). So I think the preferred solution is to avoid this instead.

github-actions bot added the parquet label Jan 28, 2021

RussellSpitzer force-pushed the FixParquetRepeatedBytes branch from bc87c79 to 40b55c5 Compare January 28, 2021 02:14

WIP - Try to fix Spark Reader

c925e3d

github-actions bot added the spark label Jan 29, 2021

Fix for Spark Reader of Non_Nested Array<Binary>

08d0f9f

Add test to check top level repeated byte array

056ca69

Fix Style Issues

66ea0d2

rdblue reviewed Feb 6, 2021

View reviewed changes

parquet/src/main/java/org/apache/iceberg/parquet/TypeWithSchemaVisitor.java Outdated Show resolved Hide resolved

rdblue reviewed Feb 6, 2021

View reviewed changes

spark/src/test/java/org/apache/iceberg/spark/data/TestMalformedParquetFromAvro.java Outdated Show resolved Hide resolved

rdblue reviewed Feb 6, 2021

View reviewed changes

parquet/src/main/java/org/apache/iceberg/parquet/ParquetTypeVisitor.java Outdated Show resolved Hide resolved

rdblue approved these changes Feb 6, 2021

View reviewed changes

Formatting Fixes

d7e8373

RussellSpitzer force-pushed the FixParquetRepeatedBytes branch from a832004 to d7e8373 Compare February 8, 2021 20:13

RussellSpitzer closed this Jul 12, 2021

RussellSpitzer mentioned this pull request Dec 13, 2021

Fix Iceberg's parquet reader returning nulls incorrectly for parquet files written by writers that don't use list and element as names. #3723

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix for Conversion of Parquet ByteArray to Iceberg Schema #2167

Fix for Conversion of Parquet ByteArray to Iceberg Schema #2167

Uh oh!

RussellSpitzer commented Jan 28, 2021

Uh oh!

RussellSpitzer commented Jan 28, 2021

Uh oh!

aokolnychyi commented Jan 28, 2021

Uh oh!

rdblue commented Jan 28, 2021 •

edited

Loading

Uh oh!

RussellSpitzer commented Jan 28, 2021

Uh oh!

RussellSpitzer commented Jan 29, 2021 •

edited

Loading

Uh oh!

RussellSpitzer commented Jan 29, 2021

Uh oh!

RussellSpitzer commented Jan 29, 2021 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rdblue left a comment

Uh oh!

RussellSpitzer commented Feb 8, 2021

Uh oh!

rdblue commented Feb 8, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix for Conversion of Parquet ByteArray to Iceberg Schema #2167

Fix for Conversion of Parquet ByteArray to Iceberg Schema #2167

Uh oh!

Conversation

RussellSpitzer commented Jan 28, 2021

Uh oh!

RussellSpitzer commented Jan 28, 2021

Uh oh!

aokolnychyi commented Jan 28, 2021

Uh oh!

rdblue commented Jan 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RussellSpitzer commented Jan 28, 2021

Uh oh!

RussellSpitzer commented Jan 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RussellSpitzer commented Jan 29, 2021

Uh oh!

RussellSpitzer commented Jan 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rdblue left a comment

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer commented Feb 8, 2021

Uh oh!

rdblue commented Feb 8, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rdblue commented Jan 28, 2021 •

edited

Loading

RussellSpitzer commented Jan 29, 2021 •

edited

Loading

RussellSpitzer commented Jan 29, 2021 •

edited

Loading