[HUDI-5271] fix issue inconsistent reader and writer schema in HoodieAvroDataBlock #7307

TengHuo · 2022-11-28T02:46:20Z

Trouble shooting detail in this issue and fix: #7284

Change Logs

Add a new parameter in method HoodieBaseRelation#convertToAvroSchema for generating a aualified name according to the table name for Avro schema
Use the new API AvroConversionUtils#convertStructTypeToAvroSchema in HoodieBaseRelation#convertToAvroSchema for struct schema to avro schema converting
Add a new test case TestMorTable for verifying the issue is fixed

Impact

No public API changed.

Risk level (write none, low medium or high below)

low

Documentation Update

No doc or configuration changed.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

…AvroDataBlock Co-authored-by: voonhous <[email protected]>

hudi-bot · 2022-11-28T06:55:22Z

CI report:

53575f2 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

TengHuo · 2022-11-28T10:03:35Z

And I found there is another similar issue in HoodieCatalogTable#initHoodieTable

hudi/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/catalyst/catalog/HoodieCatalogTable.scala

Line 189 in e088faa

.setTableCreateSchema(SchemaConverters.toAvroType(finalSchema).toString())

When creating a table from an empty path, it will set the record name in hoodie.table.create.schema as $tableName_record.

However, if the path is an existing Hudi table (hoodieTableExists is true) when creating a table, it will rewrite the record name in hoodie.table.create.schema as topLevelRecord, which is the default value of recordName in SchemaConverters#toAvroType.

May I ask if it is an issue or by design behaviour?

TengHuo · 2022-12-22T02:30:12Z

Hi
is there anyone can help to review it? Really appreciate

alexeykudinkin · 2023-01-25T02:34:17Z

@TengHuo can you please rebase to the latest master and verify whether fix is still relevant?

TengHuo · 2023-01-25T03:49:43Z

@TengHuo can you please rebase to the latest master and verify whether fix is still relevant?

@alexeykudinkin got it, np. Let me rebase it to the latest master branch.

TengHuo · 2023-01-25T04:20:16Z

Hi @alexeykudinkin

Before rebase, I have one thing want to check with you.

Last week, there was an issue about a similar exception about Avro schema namespace, #7691.

And @danny0405 mentioned in that ticket that it uses a constant namespace "record" in Flink side, #7691 (comment).

And in Spark side, I found we are using a namespace pattern "namespace": "hoodie.test_mor_tab" (test_mor_tab is Hudi table name) in writer schema, and a
constant "name": "Record" in reader schema. #7284 (comment)

May I ask which one we should follow? Think we need to keep it consistent between Spark and Flink.

danny0405 · 2023-01-25T07:55:08Z

Before rebase, I have one thing want to check with you.
Last week, there was an issue about a similar exception about Avro schema namespace, #7691.
And @danny0405 mentioned in that ticket that it uses a constant namespace "record" in Flink side, #7691 (comment).
And in Spark side, I found we are using a namespace pattern "namespace": "hoodie.test_mor_tab" (test_mor_tab is Hudi table name) in writer schema, and a
constant "name": "Record" in reader schema. #7284 (comment)
May I ask which one we should follow? Think we need to keep it consistent between Spark and Flink.

Correct. We need to unify the schema handling across Flink and Spark integrations.

If the namespace check is an Avro behavior and there is no way to work around, I'm afraid we must unify all the avro schema name spaces for read/writer schema then, does the hoodie.table_name namespace makes any sense here? How about we all use the constant record as the namespace name.

@danny0405 the problem is that we'd need to have different names b/c names are used by Avro to lookup the fields w/in the unions

TengHuo · 2023-02-02T06:31:54Z

Agree @danny0405, think it's better we unify Avro schema handling across Spark and Flink in Hudi.

Currently, we have Avro schema tools class org.apache.hudi.avro.AvroSchemaUtils in module hudi-common to manipulate Avro schema. And Hudi Spark is using org.apache.spark.sql.avro.SchemaConverters to do conversion between Spark DataType and Avro schema. Hudi Flink is using org.apache.hudi.util.AvroSchemaConverter to do conversion between Flink DataType and Avro schema.

I noticed that there is different behaviour when setting the name of a new Avro schema.

In Spark side, it exposes the name and namespace of Avro schema as method parameter.

  /**
   * Converts a Spark SQL schema to a corresponding Avro schema.
   *
   * @since 2.4.0
   */
  def toAvroType(catalystType: DataType,
                 nullable: Boolean = false,
                 recordName: String = "topLevelRecord",
                 nameSpace: String = ""): Schema

reference:

hudi/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala

Line 154 in 41653fc

def toAvroType(catalystType: DataType,

In Flink side, it uses a constant name "record"

  /**
   * Converts Flink SQL {@link LogicalType} (can be nested) into an Avro schema.
   *
   * <p>Use "record" as the type name.
   *
   * @param schema the schema type, usually it should be the top level record type, e.g. not a
   *               nested type
   * @return Avro's {@link Schema} matching this logical type.
   */
  public static Schema convertToSchema(LogicalType schema) {
    return convertToSchema(schema, "record");
  }

reference:

hudi/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/AvroSchemaConverter.java

Line 202 in 8ffcb2f

return convertToSchema(schema, "record");

(please correct me if I'm wrong)

May I know if it is possible we unify all non engine related schemas things in one place? e.g. name conversion rule

codope · 2023-02-02T09:05:54Z

@TengHuo Thanks for the detailed comment. I do agree that we should unify the schema. It bodes well for engine interoperability as well. Does this PR already cover the unification?
@danny0405 What do you think?

TengHuo · 2023-02-02T09:16:10Z

@TengHuo Thanks for the detailed comment. I do agree that we should unify the schema. It bodes well for engine interoperability as well. Does this PR already cover the unification? @danny0405 What do you think?

@codope
nope, this PR doesn't include schema unification. I was planning to fix an Avro schema namespace inconsistent issue in Spark side only. Then I notice there are different Avro schema name & namespace handling code between Flink and Spark.

So I think my fix in this PR can't solve this issue for all situation.

danny0405 · 2023-02-03T06:49:19Z

parameter

I'm +1 too to unify the avro record schema namespace, let's make the namespace parametric for Flink tool #AvroSchemaConverter.

TengHuo · 2023-04-13T09:18:30Z

This issue has been fixed in the latest master. This test case can pass now. Detail in #7284

[HUDI-5271] fix issue inconsistent reader and writer schema in Hoodie…

53575f2

…AvroDataBlock Co-authored-by: voonhous <[email protected]>

This was referenced Nov 28, 2022

[HUDI-5271] fix issue inconsistent reader and writer schema in HoodieAvroDataBlock #7297

Closed

[SUPPORT] Inconsistent reader and writer schema in HoodieAvroDataBlock cause exception #7284

Closed

codope added the area:schema Schema evolution and data types label Nov 28, 2022

codope assigned alexeykudinkin Nov 28, 2022

codope added priority:critical Production degraded; pipelines stalled priority:high Significant impact; potential bugs status:triaged Issue has been reviewed and categorized and removed priority:critical Production degraded; pipelines stalled labels Nov 28, 2022

nsivabalan added priority:blocker Production down; release blocker release-0.12.2 Patches targetted for 0.12.2 and removed priority:high Significant impact; potential bugs labels Dec 5, 2022

alexeykudinkin removed the release-0.12.2 Patches targetted for 0.12.2 label Dec 15, 2022

alexeykudinkin added priority:critical Production degraded; pipelines stalled and removed priority:blocker Production down; release blocker labels Jan 25, 2023

TengHuo mentioned this pull request Feb 2, 2023

[SUPPORT] Flink's schema conflicts with spark's schema. #7691

Closed

TengHuo closed this Apr 13, 2023

hudi-bot mentioned this pull request Dec 9, 2025

Inconsistent reader and writer schema in HoodieAvroDataBlock cause exception #15592

Open

[HUDI-5271] fix issue inconsistent reader and writer schema in HoodieAvroDataBlock #7307

[HUDI-5271] fix issue inconsistent reader and writer schema in HoodieAvroDataBlock #7307

Uh oh!

Conversation

TengHuo commented Nov 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

hudi-bot commented Nov 28, 2022

CI report:

Uh oh!

TengHuo commented Nov 28, 2022

Uh oh!

TengHuo commented Dec 22, 2022

Uh oh!

alexeykudinkin commented Jan 25, 2023

Uh oh!

TengHuo commented Jan 25, 2023

Uh oh!

TengHuo commented Jan 25, 2023

Uh oh!

danny0405 commented Jan 25, 2023 • edited by alexeykudinkin Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TengHuo commented Feb 2, 2023

Uh oh!

codope commented Feb 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TengHuo commented Feb 2, 2023

Uh oh!

danny0405 commented Feb 3, 2023

Uh oh!

TengHuo commented Apr 13, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

TengHuo commented Nov 28, 2022 •

edited

Loading

danny0405 commented Jan 25, 2023 •

edited by alexeykudinkin

Loading

codope commented Feb 2, 2023 •

edited

Loading