Fix errors when reading options in Avro files with non-null defaults #1132

shardulm94 · 2020-06-22T18:56:41Z

Avro files written by non-Iceberg writers can contain optional schemas where the NULL schema is second in the options list. If there is a default value associated with the field, we need to ensure that our visitors preserve this ordering else it can lead to issues like org.apache.avro.AvroTypeException: Invalid default for field field: [] not a ["null",{"type":"array","items":"long"}]. This is because the Avro spec requires the type of the default value to match the first option in the schema.

The changes should be limited to the codepaths which interact with non-Iceberg Avro files so I believe the visitors used in ProjectionDatumReader are the only ones affected.

Error stacktraces:

org.apache.avro.AvroTypeException: Invalid default for field field: [] not a ["null",{"type":"array","items":"int","element-id":1}]

	at org.apache.avro.Schema.validateDefault(Schema.java:1540)
	at org.apache.avro.Schema.access$500(Schema.java:87)
	at org.apache.avro.Schema$Field.<init>(Schema.java:521)
	at org.apache.avro.Schema$Field.<init>(Schema.java:567)
	at org.apache.iceberg.avro.PruneColumns.copyField(PruneColumns.java:252)
	at org.apache.iceberg.avro.PruneColumns.record(PruneColumns.java:83)
	at org.apache.iceberg.avro.PruneColumns.record(PruneColumns.java:34)
	at org.apache.iceberg.avro.AvroSchemaVisitor.visit(AvroSchemaVisitor.java:50)
	at org.apache.iceberg.avro.PruneColumns.rootSchema(PruneColumns.java:46)
	at org.apache.iceberg.avro.AvroSchemaUtil.pruneColumns(AvroSchemaUtil.java:99)
	at org.apache.iceberg.avro.ProjectionDatumReader.setSchema(ProjectionDatumReader.java:59)


org.apache.avro.AvroTypeException: Invalid default for field field: [] not a ["null",{"type":"array","items":"long"}]

	at org.apache.avro.Schema.validateDefault(Schema.java:1540)
	at org.apache.avro.Schema.access$500(Schema.java:87)
	at org.apache.avro.Schema$Field.<init>(Schema.java:521)
	at org.apache.avro.Schema$Field.<init>(Schema.java:567)
	at org.apache.iceberg.avro.AvroSchemaUtil.copyField(AvroSchemaUtil.java:362)
	at org.apache.iceberg.avro.BuildAvroProjection.field(BuildAvroProjection.java:134)
	at org.apache.iceberg.avro.BuildAvroProjection.field(BuildAvroProjection.java:41)
	at org.apache.iceberg.avro.AvroCustomOrderSchemaVisitor$VisitFieldFuture.get(AvroCustomOrderSchemaVisitor.java:124)
	at org.apache.iceberg.relocated.com.google.common.collect.Iterators$6.transform(Iterators.java:783)
	at org.apache.iceberg.relocated.com.google.common.collect.TransformedIterator.next(TransformedIterator.java:47)
	at org.apache.iceberg.relocated.com.google.common.collect.Iterators.addAll(Iterators.java:356)
	at org.apache.iceberg.relocated.com.google.common.collect.Lists.newArrayList(Lists.java:143)
	at org.apache.iceberg.relocated.com.google.common.collect.Lists.newArrayList(Lists.java:130)
	at org.apache.iceberg.avro.BuildAvroProjection.record(BuildAvroProjection.java:60)
	at org.apache.iceberg.avro.BuildAvroProjection.record(BuildAvroProjection.java:41)
	at org.apache.iceberg.avro.AvroCustomOrderSchemaVisitor.visit(AvroCustomOrderSchemaVisitor.java:51)
	at org.apache.iceberg.avro.AvroSchemaUtil.buildAvroProjection(AvroSchemaUtil.java:104)
	at org.apache.iceberg.avro.ProjectionDatumReader.setSchema(ProjectionDatumReader.java:60)

rdblue · 2020-06-22T23:01:22Z

I think that fields with non-null defaults should have the default removed.

In Iceberg, the default value for an optional column is always null. Non-null defaults are not allowed. That's why null is always the first type in the option union.

An Iceberg table schema does not have default values for columns, so it will never actually use Avro to fill in a default value. The read schemas that Iceberg passes to Avro are based on the write schema and the current table schema (for renames). The default value only comes from the write schema, so it will never be applied because the writer knew about the column and wrote values into the file. Avro defaults are only used when projecting a column that isn't in the file.

Because the default value here only comes from the write schema and will never actually be used, I think the right thing is to simply remove any default values when building the Avro projection schema.

shardulm94 · 2020-06-23T07:04:11Z

@rdblue That makes sense! Thanks for pointing that out. I understand the reasoning behind your suggestion and have tried to incorporate it in the latest commit. However, I am not sure if the implementation is the best way to go about it. Let me know if you a better idea.

rdblue · 2020-06-23T16:37:50Z

core/src/main/java/org/apache/iceberg/avro/PruneColumns.java

+    // do not copy over non-null default values as the file is expected to have values for fields in the file schema
    Schema.Field copy = new Schema.Field(field.name(),
-        newSchema, field.doc(), field.defaultVal(), field.order());
+        newSchema, field.doc(), hasNonNullDefault(field) ? null : field.defaultVal(), field.order());


Why detect non-null here? Couldn't this always pass null because it is either null already or will be replaced with null?

I think this should be similar to field construction in the converter. Instead of using hasNonNullDefault, this should be based on whether the value needs a default. In the converter, we use this: structField.isOptional() ? JsonProperties.NULL_VALUE : null. For Avro, that would be isOptionSchema(newSchema) ? JsonProperties.NULL_VALUE : null.

That way any optional field gets a null default and the default is left unspecified for required fields.

The issue is that PruneColumns#copyField can also be called at https://github.com/apache/iceberg/blob/14fb95519b3442092eb6aa02a2608e97e2e8dfd8/core/src/main/java/org/apache/iceberg/avro/PruneColumns.java#L88. In that case, it is possible that newSchema is an option schema but with the NULL type as the second option (if that is how it is in the file schema) and hence JsonProperties.NULL_VALUE is not an appropriate default.

If the schema is an option schema, then its default should be null. That means that we would need to reorder the options to allow that default, I think.

👍 So I think this PR now simplifies to "if we have an option schema where the first option is not NULL, we should reorder it to ensure it is NULL while building the projection schema". Updated.

rdblue · 2020-06-23T17:22:08Z

core/src/test/java/org/apache/iceberg/avro/TestAvroOptionsWithNonNullDefaults.java

+import static org.apache.avro.Schema.Type.LONG;
+import static org.apache.avro.Schema.Type.NULL;
+
+public class TestAvroOptionsWithNonNullDefaults {


Tests look good to me.

rdblue · 2020-06-23T17:22:43Z

@shardulm94, this looks really close to me. I think we just need to update the default value logic a little, as I noted above.

rdblue · 2020-06-23T23:27:11Z

Looks great, thanks @shardulm94!

…1132)

Fix errors when reading options in Avro files with non-null defaults

14fb955

shardulm94 force-pushed the avro-union-order-fix branch from c58716c to 14fb955 Compare June 23, 2020 06:29

rdblue reviewed Jun 23, 2020

View reviewed changes

Reorder option schemas where NULL is not the first option

bb9e013

rdblue merged commit d1ba7b6 into apache:master Jun 23, 2020

shardulm94 deleted the avro-union-order-fix branch July 15, 2020 03:51

cmathiesen pushed a commit to ExpediaGroup/iceberg that referenced this pull request Aug 19, 2020

Avro: Fix errors when reading options with non-null defaults (apache#…

c4467dd

…1132)

rzhang10 mentioned this pull request Mar 16, 2022

[LI][Avro] Do not reorder elements inside a Avro union schema linkedin/iceberg#93

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix errors when reading options in Avro files with non-null defaults #1132

Fix errors when reading options in Avro files with non-null defaults #1132

Uh oh!

shardulm94 commented Jun 22, 2020

Uh oh!

rdblue commented Jun 22, 2020

Uh oh!

shardulm94 commented Jun 23, 2020

Uh oh!

rdblue Jun 23, 2020

Uh oh!

rdblue Jun 23, 2020

Uh oh!

shardulm94 Jun 23, 2020 •

edited

Loading

Uh oh!

rdblue Jun 23, 2020

Uh oh!

shardulm94 Jun 23, 2020

Uh oh!

rdblue Jun 23, 2020

Uh oh!

rdblue commented Jun 23, 2020

Uh oh!

rdblue commented Jun 23, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix errors when reading options in Avro files with non-null defaults #1132

Fix errors when reading options in Avro files with non-null defaults #1132

Uh oh!

Conversation

shardulm94 commented Jun 22, 2020

Uh oh!

rdblue commented Jun 22, 2020

Uh oh!

shardulm94 commented Jun 23, 2020

Uh oh!

rdblue Jun 23, 2020

Choose a reason for hiding this comment

Uh oh!

rdblue Jun 23, 2020

Choose a reason for hiding this comment

Uh oh!

shardulm94 Jun 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue Jun 23, 2020

Choose a reason for hiding this comment

Uh oh!

shardulm94 Jun 23, 2020

Choose a reason for hiding this comment

Uh oh!

rdblue Jun 23, 2020

Choose a reason for hiding this comment

Uh oh!

rdblue commented Jun 23, 2020

Uh oh!

rdblue commented Jun 23, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shardulm94 Jun 23, 2020 •

edited

Loading