-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-5271] fix issue inconsistent reader and writer schema in HoodieAvroDataBlock #7307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…AvroDataBlock Co-authored-by: voonhous <[email protected]>
|
And I found there is another similar issue in Line 189 in e088faa
When creating a table from an empty path, it will set the record name in However, if the path is an existing Hudi table ( May I ask if it is an issue or by design behaviour? |
|
Hi |
|
@TengHuo can you please rebase to the latest master and verify whether fix is still relevant? |
@alexeykudinkin got it, np. Let me rebase it to the latest master branch. |
|
Before rebase, I have one thing want to check with you. Last week, there was an issue about a similar exception about Avro schema namespace, #7691. And @danny0405 mentioned in that ticket that it uses a constant namespace "record" in Flink side, #7691 (comment). And in Spark side, I found we are using a namespace pattern May I ask which one we should follow? Think we need to keep it consistent between Spark and Flink. |
Correct. We need to unify the schema handling across Flink and Spark integrations.
@danny0405 the problem is that we'd need to have different names b/c names are used by Avro to lookup the fields w/in the unions |
|
Agree @danny0405, think it's better we unify Avro schema handling across Spark and Flink in Hudi. Currently, we have Avro schema tools class I noticed that there is different behaviour when setting the name of a new Avro schema. In Spark side, it exposes the name and namespace of Avro schema as method parameter. /**
* Converts a Spark SQL schema to a corresponding Avro schema.
*
* @since 2.4.0
*/
def toAvroType(catalystType: DataType,
nullable: Boolean = false,
recordName: String = "topLevelRecord",
nameSpace: String = ""): Schemareference: Line 154 in 41653fc
In Flink side, it uses a constant name /**
* Converts Flink SQL {@link LogicalType} (can be nested) into an Avro schema.
*
* <p>Use "record" as the type name.
*
* @param schema the schema type, usually it should be the top level record type, e.g. not a
* nested type
* @return Avro's {@link Schema} matching this logical type.
*/
public static Schema convertToSchema(LogicalType schema) {
return convertToSchema(schema, "record");
}reference: hudi/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/AvroSchemaConverter.java Line 202 in 8ffcb2f
(please correct me if I'm wrong) May I know if it is possible we unify all non engine related schemas things in one place? e.g. name conversion rule |
|
@TengHuo Thanks for the detailed comment. I do agree that we should unify the schema. It bodes well for engine interoperability as well. Does this PR already cover the unification? |
@codope So I think my fix in this PR can't solve this issue for all situation. |
I'm +1 too to unify the avro record schema namespace, let's make the namespace parametric for Flink tool |
|
This issue has been fixed in the latest master. This test case can pass now. Detail in #7284 |
Trouble shooting detail in this issue and fix: #7284
Change Logs
HoodieBaseRelation#convertToAvroSchemafor generating a aualified name according to the table name for Avro schemaAvroConversionUtils#convertStructTypeToAvroSchemainHoodieBaseRelation#convertToAvroSchemafor struct schema to avro schema convertingTestMorTablefor verifying the issue is fixedImpact
No public API changed.
Risk level (write none, low medium or high below)
low
Documentation Update
No doc or configuration changed.
Contributor's checklist