Skip to content

Conversation

@nsivabalan
Copy link
Contributor

@nsivabalan nsivabalan commented Apr 4, 2021

What is the purpose of the pull request

For Union schema types, avro expects "null" to be first entry for null to be considered as default value. But since Hudi leverages SchemaConverters.toAvroType(...) from org.apache.spark:spark-avro_* library, structType to avro results in "null" being 2nd entry for UNION type schemas. Also, there is no default value set in this avro schema thus generated. This patch fixes this issue.

This also fixes simple schema evolution w/ MOR tables. adding new columns was failing if not for this patch.

For eg:
if incoming structType is
StructType(StructField(rowId,StringType,true), StructField(partitionId,StringType,true), StructField(preComb,LongType,true), StructField(name,StringType,true))

Generated avro scheme if not for this patch:
{"type":"record","name":"hudi_trips_cow_record","namespace":"hoodie.hudi_trips_cow","fields":[{"name":"rowId","type":["string","null"]},{"name":"partitionId","type":["string","null"]},{"name":"preComb","type":["string","null"]},{"name":"name","type":["string","null"]}]}

Note that "null" is 2nd entry in UNIONs and there is no default set.

Generated avro schema with this patch:
{"type":"record","name":"hudi_trips_cow_record","namespace":"hoodie.hudi_trips_cow","fields":[{"name":"rowId","type":["null","string"],"default":null},{"name":"partitionId","type":["null","string"],"default":null},{"name":"preComb","type":["null","long"],"default":null},{"name":"name","type":["null","string"],"default":null},{"name":"newField","type":["null","string"],"default":null}]}

Note that default value is null and not null in string format. So, this should work for other data types as well(not just strings). Default value of null is allowed only if null is the first entry in UNION for a given schema field.

Brief change log

  • Fixing struct type to avro schema conversion to fix null as first entry in UNION schema types and adds default values for the same.

Verify this pull request

This change added tests and can be verified as follows:

  • Added tests to HoodieSparkSqlWriterSuite // New test added fails if not for the fix in source code for MOR table and succeeds w/ the fix.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@nsivabalan nsivabalan requested a review from n3nash April 4, 2021 04:07
@nsivabalan nsivabalan added the priority:critical Production degraded; pipelines stalled label Apr 4, 2021
@nsivabalan
Copy link
Contributor Author

@bvaradar : I see that SchemaConverters.toAvroType(...) is used in 3 places in hudi code base. 2 of those have been fixed as part of this patch. Wondering if we need to fix HoodieSparkBootstrapSchemaProvider as well.

@nsivabalan nsivabalan added the area:schema Schema evolution and data types label Apr 4, 2021
@aditiwari01
Copy link
Contributor

@bvaradar : I see that SchemaConverters.toAvroType(...) is used in 3 places in hudi code base. 2 of those have been fixed as part of this patch. Wondering if we need to fix HoodieSparkBootstrapSchemaProvider as well.

Yes, One of the tests fails because of that. Should be fixed for bootstrap also.

field.schema().getType match {
case Schema.Type.RECORD => {
val newSchema = getAvroSchemaWithDefaults(field.schema())
new Schema.Field(field.name(), newSchema, field.doc(), JsonProperties.NULL_VALUE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we setting null default here?
if a field is nullabe, it would be of type Union. IMO Record field must nnot have default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. feel free to fix it. the one you had in your draft commit was using deprecated constructor and thought of fixing it. we could probably do field.defaultValue()

@nsivabalan
Copy link
Contributor Author

Closing this in favor of #2765

@nsivabalan nsivabalan closed this Apr 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:schema Schema evolution and data types priority:critical Production degraded; pipelines stalled

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants