Skip to content

Conversation

@shenodaguirguis
Copy link

@shenodaguirguis shenodaguirguis commented May 15, 2021

Last change of 3 (see description: apache/iceberg#2496 (comment))
The main changes here are in BuildAvroProjection.java and PruneColumns.java which makes sure that default value is copied over and used while reading projected columns with default values.
Other changes are utils, and testing changes.

There will be a followup PR to handle default values in serialization/deserializaiton, but this can go on parallel with ORC and Parquet changes.

[Update: during integration testing, I had to implement the default value serialization/deserialization. It works now on spark shell to select missing columns with default values and return the default value:

// required with non-null default
scala> spark.sql(s"select * from u_sguirgui.testi6").show
+---------+-----------------+                                                    
|firstname |     lastname          |
+---------+-----------------+
|        f        |               l.             |
|     Adam  | default lastname | <-- omitting the required col altogether
+---------+-----------------+
// optional with default value
scala> DaliSpark.createDataFrame("dalids:///u_sguirgui.test5", Map(DaliSpark.PROJECT_COLS -> "firstname, lastname")).show
+---------+----------------+
| firstname|        lastname      |
+---------+----------------+
|     Abel     |            Adam       |
|     Adam   |            null           | <-- using the null option
|     Adam|default lastname   |  <-- omitting the col altogether
+---------+----------------+

Also, during integration testing, it turns out that Avro default values for records are Maps, not Lists, so had to change that in the NestedField class.

@shenodaguirguis
Copy link
Author

@wmoustafa @funcheetah @HotSushi please take a look

Comment on lines +125 to +131
Schema.Field newField = new Schema.Field(newFieldName, newFiledSchema, null, defaultValue);
newField.addProp(AvroSchemaUtil.FIELD_ID_PROP, expectedField.fieldId());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This branch seems to have common code with a the out if branch. Do you see room to combine and simplify?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have 4 cases (for 2 conditions), for three of them we create a new field and add to updatedFileds, but different fields. The only thing that can be done is to create a shallow 3-lines function with 5 argument. I prefer to keep as is

Schema newSchemaReordered;
// if the newSchema is an optional schema, make sure the NULL option is always the first
if (isOptionSchemaWithNonNullFirstOption(newSchema)) {
// if the newSchema is an optional schema with no, or null, default value, then make sure the

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about if the newSchema is an optional schema or has a default value that is null, then make sure the NULL option is the first.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(optional_no_default || optional_null_default) != (optional || has_default)

if (isOptionSchemaWithNonNullFirstOption(newSchema)) {
// if the newSchema is an optional schema with no, or null, default value, then make sure the
// NULL option is the first
boolean hasNonNullDefaultValue = field.hasDefaultValue() && AvroSchemaUtil.isNonNullDefault(field.defaultVal());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactoring as recommended above may help here.

Object defaultValue = field.hasDefaultValue() && !(field.defaultVal() instanceof JsonProperties.Null) ?
field.defaultVal() : null;
Object defaultValue =
field.hasDefaultValue() && AvroSchemaUtil.isNonNullDefault(field.defaultVal()) ? field.defaultVal() : null;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the value of calling isNonNullDefault here? It just another way to do the same check as before !(field.defaultVal() instanceof JsonProperties.Null)?


private static String toJsonString(Object value) {
if (isPrimitiveClass(value)) {
return value.toString();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For FIXED and BINARY, should do something like Base64.getEncoder().encodeToString(bytes)? (also the opposite transformation on the deserialization side).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch! on it

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrapped into ByteBuffer

Copy link

@funcheetah funcheetah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @shenodaguirguis ! This PR looks good to me in general and leave a few minor comments. One question I have: is default value for complex union supported?

@shenodaguirguis shenodaguirguis force-pushed the master branch 2 times, most recently from c5de40a to 1a15a81 Compare June 16, 2021 19:26
Copy link

@funcheetah funcheetah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @shenodaguirguis for this feature!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants