Skip to content

Conversation

@shardulm94
Copy link
Contributor

Avro files written by non-Iceberg writers can contain optional schemas where the NULL schema is second in the options list. If there is a default value associated with the field, we need to ensure that our visitors preserve this ordering else it can lead to issues like org.apache.avro.AvroTypeException: Invalid default for field field: [] not a ["null",{"type":"array","items":"long"}]. This is because the Avro spec requires the type of the default value to match the first option in the schema.

The changes should be limited to the codepaths which interact with non-Iceberg Avro files so I believe the visitors used in ProjectionDatumReader are the only ones affected.

Error stacktraces:

org.apache.avro.AvroTypeException: Invalid default for field field: [] not a ["null",{"type":"array","items":"int","element-id":1}]

	at org.apache.avro.Schema.validateDefault(Schema.java:1540)
	at org.apache.avro.Schema.access$500(Schema.java:87)
	at org.apache.avro.Schema$Field.<init>(Schema.java:521)
	at org.apache.avro.Schema$Field.<init>(Schema.java:567)
	at org.apache.iceberg.avro.PruneColumns.copyField(PruneColumns.java:252)
	at org.apache.iceberg.avro.PruneColumns.record(PruneColumns.java:83)
	at org.apache.iceberg.avro.PruneColumns.record(PruneColumns.java:34)
	at org.apache.iceberg.avro.AvroSchemaVisitor.visit(AvroSchemaVisitor.java:50)
	at org.apache.iceberg.avro.PruneColumns.rootSchema(PruneColumns.java:46)
	at org.apache.iceberg.avro.AvroSchemaUtil.pruneColumns(AvroSchemaUtil.java:99)
	at org.apache.iceberg.avro.ProjectionDatumReader.setSchema(ProjectionDatumReader.java:59)


org.apache.avro.AvroTypeException: Invalid default for field field: [] not a ["null",{"type":"array","items":"long"}]

	at org.apache.avro.Schema.validateDefault(Schema.java:1540)
	at org.apache.avro.Schema.access$500(Schema.java:87)
	at org.apache.avro.Schema$Field.<init>(Schema.java:521)
	at org.apache.avro.Schema$Field.<init>(Schema.java:567)
	at org.apache.iceberg.avro.AvroSchemaUtil.copyField(AvroSchemaUtil.java:362)
	at org.apache.iceberg.avro.BuildAvroProjection.field(BuildAvroProjection.java:134)
	at org.apache.iceberg.avro.BuildAvroProjection.field(BuildAvroProjection.java:41)
	at org.apache.iceberg.avro.AvroCustomOrderSchemaVisitor$VisitFieldFuture.get(AvroCustomOrderSchemaVisitor.java:124)
	at org.apache.iceberg.relocated.com.google.common.collect.Iterators$6.transform(Iterators.java:783)
	at org.apache.iceberg.relocated.com.google.common.collect.TransformedIterator.next(TransformedIterator.java:47)
	at org.apache.iceberg.relocated.com.google.common.collect.Iterators.addAll(Iterators.java:356)
	at org.apache.iceberg.relocated.com.google.common.collect.Lists.newArrayList(Lists.java:143)
	at org.apache.iceberg.relocated.com.google.common.collect.Lists.newArrayList(Lists.java:130)
	at org.apache.iceberg.avro.BuildAvroProjection.record(BuildAvroProjection.java:60)
	at org.apache.iceberg.avro.BuildAvroProjection.record(BuildAvroProjection.java:41)
	at org.apache.iceberg.avro.AvroCustomOrderSchemaVisitor.visit(AvroCustomOrderSchemaVisitor.java:51)
	at org.apache.iceberg.avro.AvroSchemaUtil.buildAvroProjection(AvroSchemaUtil.java:104)
	at org.apache.iceberg.avro.ProjectionDatumReader.setSchema(ProjectionDatumReader.java:60)

@rdblue
Copy link
Contributor

rdblue commented Jun 22, 2020

I think that fields with non-null defaults should have the default removed.

In Iceberg, the default value for an optional column is always null. Non-null defaults are not allowed. That's why null is always the first type in the option union.

An Iceberg table schema does not have default values for columns, so it will never actually use Avro to fill in a default value. The read schemas that Iceberg passes to Avro are based on the write schema and the current table schema (for renames). The default value only comes from the write schema, so it will never be applied because the writer knew about the column and wrote values into the file. Avro defaults are only used when projecting a column that isn't in the file.

Because the default value here only comes from the write schema and will never actually be used, I think the right thing is to simply remove any default values when building the Avro projection schema.

@shardulm94 shardulm94 force-pushed the avro-union-order-fix branch from c58716c to 14fb955 Compare June 23, 2020 06:29
@shardulm94
Copy link
Contributor Author

@rdblue That makes sense! Thanks for pointing that out. I understand the reasoning behind your suggestion and have tried to incorporate it in the latest commit. However, I am not sure if the implementation is the best way to go about it. Let me know if you a better idea.

// do not copy over non-null default values as the file is expected to have values for fields in the file schema
Schema.Field copy = new Schema.Field(field.name(),
newSchema, field.doc(), field.defaultVal(), field.order());
newSchema, field.doc(), hasNonNullDefault(field) ? null : field.defaultVal(), field.order());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why detect non-null here? Couldn't this always pass null because it is either null already or will be replaced with null?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be similar to field construction in the converter. Instead of using hasNonNullDefault, this should be based on whether the value needs a default. In the converter, we use this: structField.isOptional() ? JsonProperties.NULL_VALUE : null. For Avro, that would be isOptionSchema(newSchema) ? JsonProperties.NULL_VALUE : null.

That way any optional field gets a null default and the default is left unspecified for required fields.

Copy link
Contributor Author

@shardulm94 shardulm94 Jun 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that PruneColumns#copyField can also be called at https://github.com/apache/iceberg/blob/14fb95519b3442092eb6aa02a2608e97e2e8dfd8/core/src/main/java/org/apache/iceberg/avro/PruneColumns.java#L88. In that case, it is possible that newSchema is an option schema but with the NULL type as the second option (if that is how it is in the file schema) and hence JsonProperties.NULL_VALUE is not an appropriate default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the schema is an option schema, then its default should be null. That means that we would need to reorder the options to allow that default, I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 So I think this PR now simplifies to "if we have an option schema where the first option is not NULL, we should reorder it to ensure it is NULL while building the projection schema". Updated.

import static org.apache.avro.Schema.Type.LONG;
import static org.apache.avro.Schema.Type.NULL;

public class TestAvroOptionsWithNonNullDefaults {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests look good to me.

@rdblue
Copy link
Contributor

rdblue commented Jun 23, 2020

@shardulm94, this looks really close to me. I think we just need to update the default value logic a little, as I noted above.

@rdblue rdblue merged commit d1ba7b6 into apache:master Jun 23, 2020
@rdblue
Copy link
Contributor

rdblue commented Jun 23, 2020

Looks great, thanks @shardulm94!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants