-
Notifications
You must be signed in to change notification settings - Fork 36
[#2039] Support Default Value Semantis: use default values reading AVRO files #72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@wmoustafa @funcheetah @HotSushi please take a look |
core/src/main/java/org/apache/iceberg/avro/BuildAvroProjection.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/avro/BuildAvroProjection.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/avro/BuildAvroProjection.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/avro/BuildAvroProjection.java
Outdated
Show resolved
Hide resolved
| Schema.Field newField = new Schema.Field(newFieldName, newFiledSchema, null, defaultValue); | ||
| newField.addProp(AvroSchemaUtil.FIELD_ID_PROP, expectedField.fieldId()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This branch seems to have common code with a the out if branch. Do you see room to combine and simplify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have 4 cases (for 2 conditions), for three of them we create a new field and add to updatedFileds, but different fields. The only thing that can be done is to create a shallow 3-lines function with 5 argument. I prefer to keep as is
| Schema newSchemaReordered; | ||
| // if the newSchema is an optional schema, make sure the NULL option is always the first | ||
| if (isOptionSchemaWithNonNullFirstOption(newSchema)) { | ||
| // if the newSchema is an optional schema with no, or null, default value, then make sure the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about if the newSchema is an optional schema or has a default value that is null, then make sure the NULL option is the first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(optional_no_default || optional_null_default) != (optional || has_default)
| if (isOptionSchemaWithNonNullFirstOption(newSchema)) { | ||
| // if the newSchema is an optional schema with no, or null, default value, then make sure the | ||
| // NULL option is the first | ||
| boolean hasNonNullDefaultValue = field.hasDefaultValue() && AvroSchemaUtil.isNonNullDefault(field.defaultVal()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactoring as recommended above may help here.
| Object defaultValue = field.hasDefaultValue() && !(field.defaultVal() instanceof JsonProperties.Null) ? | ||
| field.defaultVal() : null; | ||
| Object defaultValue = | ||
| field.hasDefaultValue() && AvroSchemaUtil.isNonNullDefault(field.defaultVal()) ? field.defaultVal() : null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the value of calling isNonNullDefault here? It just another way to do the same check as before !(field.defaultVal() instanceof JsonProperties.Null)?
core/src/test/java/org/apache/iceberg/TestSchemaParserForDefaultValues.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/TestSchemaParserForDefaultValues.java
Show resolved
Hide resolved
|
|
||
| private static String toJsonString(Object value) { | ||
| if (isPrimitiveClass(value)) { | ||
| return value.toString(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For FIXED and BINARY, should do something like Base64.getEncoder().encodeToString(bytes)? (also the opposite transformation on the deserialization side).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch! on it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrapped into ByteBuffer
funcheetah
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @shenodaguirguis ! This PR looks good to me in general and leave a few minor comments. One question I have: is default value for complex union supported?
core/src/test/java/org/apache/iceberg/avro/TestAvroOptionsWithNonNullDefaults.java
Outdated
Show resolved
Hide resolved
c5de40a to
1a15a81
Compare
funcheetah
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @shenodaguirguis for this feature!
Last change of 3 (see description: apache/iceberg#2496 (comment))
The main changes here are in BuildAvroProjection.java and PruneColumns.java which makes sure that default value is copied over and used while reading projected columns with default values.
Other changes are utils, and testing changes.
There will be a followup PR to handle default values in serialization/deserializaiton, but this can go on parallel with ORC and Parquet changes.
[Update: during integration testing, I had to implement the default value serialization/deserialization. It works now on spark shell to select missing columns with default values and return the default value:
Also, during integration testing, it turns out that Avro default values for records are Maps, not Lists, so had to change that in the NestedField class.