Skip to content

Conversation

@RussellSpitzer
Copy link
Member

Previously the Iceberg conversion functions for Parquet would throw an exception if
they encountered a Binary type field. This was internally represented as a repeated
primitive field that is not nested in another group type. This violated some expecations
within our schema conversion code.

We encountered this with a user who was using Parquet's AvroParquetWriter class to write Parquet files. The files, while readable by hive and spark, were not readable by iceberg.

Investigating this I found the following Avro Schema element caused the problem

  String schema = "{\n" +
        "   \"type\":\"record\",\n" +
        "   \"name\":\"DbRecord\",\n" +
        "   \"namespace\":\"com.russ\",\n" +
        "   \"fields\":[\n" +
        "      {\n" +
        "         \"name\":\"foo\",\n" +
        "         \"type\":[\n" +
        "            \"null\",\n" +
        "            {\n" +
        "               \"type\":\"array\",\n" +
        "               \"items\":\"bytes\"\n" +
        "            }\n" +
        "         ],\n" +
        "         \"default\":null\n" +
        "      }\n" +
        "   ]\n" +
        "}";

Parquet would convert this element into

foo:
OPTIONAL F:1
.array: REPEATED BINARY R:1 D:2

Which violates Iceberg's reader, which assumes the list will be nested.

Doing a quick test with

org.apache.avro.Schema.Parser parser = new org.apache.avro.Schema.Parser();
 org.apache.avro.Schema avroSchema = parser.parse(schema);
 AvroSchemaConverter converter = new AvroSchemaConverter();
 MessageType parquetSchema = converter.convert(avroSchema);

I saw that this was reproducible in the current version of Parquet and not just in our User's code.

To fix this I added some tests for this particular datatype and loosened some of the restrictions
in our Parquet Schema parsing code.

@RussellSpitzer
Copy link
Member Author

@aokolnychyi + @rdblue This is the issue I was talking about before

Previously the Iceberg conversion functions for Parquet would throw an exception if
they encountered a Binary type field. This was internally represented as a repeated
primitive field that is not nested in another group type. This violated some expecations
within our schema conversion code.
@aokolnychyi
Copy link
Contributor

@rdblue, could you help on this one?

@RussellSpitzer RussellSpitzer force-pushed the FixParquetRepeatedBytes branch from bc87c79 to 40b55c5 Compare January 28, 2021 02:14
@rdblue
Copy link
Contributor

rdblue commented Jan 28, 2021

This doesn't seem that bad. Can you add a test that writes a Parquet file with that schema and validates that Spark can read it? Maybe a TestMalformedParquet suite?

@RussellSpitzer
Copy link
Member Author

@rdblue Just bad enough I hope, I really have no idea what's going on here at a spec level, i'm just trying to match what I see as valid from the internal apis :)

I'll add in another test

@RussellSpitzer
Copy link
Member Author

RussellSpitzer commented Jan 29, 2021

@rdblue I have an additional issue :/

In SparkParquetReaders we call

      ColumnDescriptor desc = type.getColumnDescription(currentPath());

Which fails when traversing with

Arrived at primitive node, path invalid
org.apache.parquet.io.InvalidRecordException: Arrived at primitive node, path invalid
	at org.apache.parquet.schema.PrimitiveType.getMaxRepetitionLevel(PrimitiveType.java:665)
	at org.apache.parquet.schema.GroupType.getMaxRepetitionLevel(GroupType.java:294)
	at org.apache.parquet.schema.GroupType.getMaxRepetitionLevel(GroupType.java:294)
	at org.apache.parquet.schema.MessageType.getMaxRepetitionLevel(MessageType.java:77)
	at org.apache.parquet.schema.MessageType.getColumnDescription(MessageType.java:94)
	at org.apache.iceberg.spark.data.SparkParquetReaders$ReadBuilder.primitive(SparkParquetReaders.java:222)

I'm not sure what the "Desc" is supposed to be here. There is no issue with writing the file or reading the file using the generic Parquet.read() file. But once I use the SparkParquetReader I have the issue.

Do you have any ideas how this can be fixed?

@github-actions github-actions bot added the spark label Jan 29, 2021
@RussellSpitzer
Copy link
Member Author

Figured out Array I may have to write a test for a tope level repeated binary array as well :/

@RussellSpitzer
Copy link
Member Author

RussellSpitzer commented Jan 29, 2021

malformed_parquet_not_txt.txt

This is the Parquet file generated by the test, just in case anyone wants to take a look at what we are dealing with here with another framework.

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall.

@RussellSpitzer
Copy link
Member Author

Formatting fixes in, @rdblue Thanks for looking over this, I really appreciate it

@RussellSpitzer RussellSpitzer force-pushed the FixParquetRepeatedBytes branch from a832004 to d7e8373 Compare February 8, 2021 20:13
@rdblue
Copy link
Contributor

rdblue commented Feb 8, 2021

I've been thinking about this more and I'm leaning toward trying to work around it. I think the problem is that the Parquet/Avro writer uses the old list format by default to avoid breaking existing pipelines. But there should be an easy way to update the behavior to produce records that Iceberg accepts by setting parquet.avro.write-old-list-structure=false. That's true by default.

If we can fix it that way, then I think we should go with that. Otherwise, we're implementing only part of the backward-compatibility rules from Parquet. I'm not sure what the impact would be on compatibility if we only partially implement the rules, so I think the safer thing is to just implement all of the backward-compatibility rules. But that's a bigger change and more to maintain (which is why we don't support the 2-level lists in the first place). So I think the preferred solution is to avoid this instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants