-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Fix errors when reading options in Avro files with non-null defaults #1132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I think that fields with non-null defaults should have the default removed. In Iceberg, the default value for an optional column is always null. Non-null defaults are not allowed. That's why null is always the first type in the option union. An Iceberg table schema does not have default values for columns, so it will never actually use Avro to fill in a default value. The read schemas that Iceberg passes to Avro are based on the write schema and the current table schema (for renames). The default value only comes from the write schema, so it will never be applied because the writer knew about the column and wrote values into the file. Avro defaults are only used when projecting a column that isn't in the file. Because the default value here only comes from the write schema and will never actually be used, I think the right thing is to simply remove any default values when building the Avro projection schema. |
c58716c to
14fb955
Compare
|
@rdblue That makes sense! Thanks for pointing that out. I understand the reasoning behind your suggestion and have tried to incorporate it in the latest commit. However, I am not sure if the implementation is the best way to go about it. Let me know if you a better idea. |
| // do not copy over non-null default values as the file is expected to have values for fields in the file schema | ||
| Schema.Field copy = new Schema.Field(field.name(), | ||
| newSchema, field.doc(), field.defaultVal(), field.order()); | ||
| newSchema, field.doc(), hasNonNullDefault(field) ? null : field.defaultVal(), field.order()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why detect non-null here? Couldn't this always pass null because it is either null already or will be replaced with null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be similar to field construction in the converter. Instead of using hasNonNullDefault, this should be based on whether the value needs a default. In the converter, we use this: structField.isOptional() ? JsonProperties.NULL_VALUE : null. For Avro, that would be isOptionSchema(newSchema) ? JsonProperties.NULL_VALUE : null.
That way any optional field gets a null default and the default is left unspecified for required fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is that PruneColumns#copyField can also be called at https://github.com/apache/iceberg/blob/14fb95519b3442092eb6aa02a2608e97e2e8dfd8/core/src/main/java/org/apache/iceberg/avro/PruneColumns.java#L88. In that case, it is possible that newSchema is an option schema but with the NULL type as the second option (if that is how it is in the file schema) and hence JsonProperties.NULL_VALUE is not an appropriate default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the schema is an option schema, then its default should be null. That means that we would need to reorder the options to allow that default, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 So I think this PR now simplifies to "if we have an option schema where the first option is not NULL, we should reorder it to ensure it is NULL while building the projection schema". Updated.
| import static org.apache.avro.Schema.Type.LONG; | ||
| import static org.apache.avro.Schema.Type.NULL; | ||
|
|
||
| public class TestAvroOptionsWithNonNullDefaults { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests look good to me.
|
@shardulm94, this looks really close to me. I think we just need to update the default value logic a little, as I noted above. |
|
Looks great, thanks @shardulm94! |
Avro files written by non-Iceberg writers can contain optional schemas where the NULL schema is second in the options list. If there is a default value associated with the field, we need to ensure that our visitors preserve this ordering else it can lead to issues like
org.apache.avro.AvroTypeException: Invalid default for field field: [] not a ["null",{"type":"array","items":"long"}]. This is because the Avro spec requires the type of the default value to match the first option in the schema.The changes should be limited to the codepaths which interact with non-Iceberg Avro files so I believe the visitors used in
ProjectionDatumReaderare the only ones affected.Error stacktraces: