-
Notifications
You must be signed in to change notification settings - Fork 3k
Fix for Conversion of Parquet ByteArray to Iceberg Schema #2167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for Conversion of Parquet ByteArray to Iceberg Schema #2167
Conversation
|
@aokolnychyi + @rdblue This is the issue I was talking about before |
Previously the Iceberg conversion functions for Parquet would throw an exception if they encountered a Binary type field. This was internally represented as a repeated primitive field that is not nested in another group type. This violated some expecations within our schema conversion code.
|
@rdblue, could you help on this one? |
bc87c79 to
40b55c5
Compare
|
This doesn't seem that bad. Can you add a test that writes a Parquet file with that schema and validates that Spark can read it? Maybe a |
|
@rdblue Just bad enough I hope, I really have no idea what's going on here at a spec level, i'm just trying to match what I see as valid from the internal apis :) I'll add in another test |
|
@rdblue I have an additional issue :/ In SparkParquetReaders we call ColumnDescriptor desc = type.getColumnDescription(currentPath());Which fails when traversing with I'm not sure what the "Desc" is supposed to be here. There is no issue with writing the file or reading the file using the generic Parquet.read() file. But once I use the SparkParquetReader I have the issue. Do you have any ideas how this can be fixed? |
|
Figured out Array I may have to write a test for a tope level repeated binary array as well :/ |
|
This is the Parquet file generated by the test, just in case anyone wants to take a look at what we are dealing with here with another framework. |
parquet/src/main/java/org/apache/iceberg/parquet/TypeWithSchemaVisitor.java
Outdated
Show resolved
Hide resolved
spark/src/test/java/org/apache/iceberg/spark/data/TestMalformedParquetFromAvro.java
Outdated
Show resolved
Hide resolved
parquet/src/main/java/org/apache/iceberg/parquet/ParquetTypeVisitor.java
Outdated
Show resolved
Hide resolved
rdblue
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall.
|
Formatting fixes in, @rdblue Thanks for looking over this, I really appreciate it |
a832004 to
d7e8373
Compare
|
I've been thinking about this more and I'm leaning toward trying to work around it. I think the problem is that the Parquet/Avro writer uses the old list format by default to avoid breaking existing pipelines. But there should be an easy way to update the behavior to produce records that Iceberg accepts by setting If we can fix it that way, then I think we should go with that. Otherwise, we're implementing only part of the backward-compatibility rules from Parquet. I'm not sure what the impact would be on compatibility if we only partially implement the rules, so I think the safer thing is to just implement all of the backward-compatibility rules. But that's a bigger change and more to maintain (which is why we don't support the 2-level lists in the first place). So I think the preferred solution is to avoid this instead. |
Previously the Iceberg conversion functions for Parquet would throw an exception if
they encountered a Binary type field. This was internally represented as a repeated
primitive field that is not nested in another group type. This violated some expecations
within our schema conversion code.
We encountered this with a user who was using Parquet's AvroParquetWriter class to write Parquet files. The files, while readable by hive and spark, were not readable by iceberg.
Investigating this I found the following Avro Schema element caused the problem
Parquet would convert this element into
Which violates Iceberg's reader, which assumes the list will be nested.
Doing a quick test with
I saw that this was reproducible in the current version of Parquet and not just in our User's code.
To fix this I added some tests for this particular datatype and loosened some of the restrictions
in our Parquet Schema parsing code.