[HUDI-5835] After performing the update operation, the hoodie table cannot be read normally by spark #8026

xiarixiaoyao · 2023-02-23T11:10:39Z

Change Logs

Fix the bug: After performing the update operation, the hodie table cannot be read normally by spark

Impact

if user create table with decimal type,
after performing the update operation, the hodie table cannot be read normally by spark

Risk level (write none, low medium or high below)

high

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

xiarixiaoyao · 2023-02-23T11:12:44Z

@alexeykudinkin @XuQianJin-Stars
could you pls help me review this pr, thanks

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala

xiarixiaoyao · 2023-02-23T11:25:22Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala

+    val avroNameAndSpace = AvroConversionUtils.getAvroRecordNameAndNamespace(tableName)
    val avroSchema = internalSchemaOpt.map { is =>
-      AvroInternalSchemaConverter.convert(is, "schema")
+      AvroInternalSchemaConverter.convert(is, avroNameAndSpace._2 + "." + avroNameAndSpace._1)


we should pass avroSchema.getFullName to this convert
Otherwise, the schema converted by the converter will be incompatible with the real schema of the hodie table

@xiarixiaoyao can you please share the stacktrace you've observed? Avro name/namespaces shouldn't matter in that case.

For the context: this name/namespace are actually generated from the table name so that qualified name is no better than the previous one (using just "schema").

We need to understand the real root-cause of the issue

@alexeykudinkin thanks for your review.

schema evolution has nothing to do with this scene，since schema evolution will call HoodieAvroUtils.rewriteRecordWithNewSchema to uinfy namespace. i change this line just want to ensure that the namespace of reading schema and writing schema are consistent.

The namespace of the schema used by hudi when writing the log is from tableName， but the namespace of read schema is “schema"

When the schema evolution is not enabled，For decimal types, different namespaces produce different names， avro is name sensitive. we should keep the read schema and write schema has the same namespace just as previous versions of hudi
eg:
ff decimal(38， 10)
hudi log write schema will be : {"name":"ff","type":[{"type":"fixed","name":"fixed","namespace":"hoodie.h0.h0_record.ff","size":16,"logicalType":"decimal","precision":38,"scale":10}

spark read schema will be
ff type is : "name":"ff","type":[{"type":"fixed","name":"fixed","namespace":"Record.ff","size":16,"logicalType":"decimal","precision":38,"scale":10},"null"]}

the read schema and write schema is incompatible， we cannot use read schema to read log。previous versions of hudi does not have this problem

Caused by: org.apache.avro.AvroTypeException: Found hoodie.h0.h0_record.ff.fixed, expecting union
at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:308)
at org.apache.avro.io.parsing.Parser.advance(Parser.java:86)
at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:275)
at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:188)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:161)
at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:260)
at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:248)
at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:180)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:161)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:154)
at org.apache.hudi.common.table.log.block.HoodieAvroDataBlock$RecordIterator.next(HoodieAvroDataBlock.java:201)
at org.apache.hudi.common.table.log.block.HoodieAvroDataBlock$RecordIterator.next(HoodieAvroDataBlock.java:149)

xiarixiaoyao · 2023-02-23T11:26:34Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala

+    val (recordName, namespace) = AvroConversionUtils.getAvroRecordNameAndNamespace(tableName)
+    val avroSchema = sparkAdapter.getAvroSchemaConverters.toAvroType(structSchema, nullable = false, recordName, namespace)
+    getAvroSchemaWithDefaults(avroSchema, structSchema)
+  }


@qidian99 keep default value for avro schema，
as we need do schema evolution. this modify also fix the bug #7915

@xiarixiaoyao i still don't understand why we need to set defaults in the schema. Can you please elaborate on that one?

yes,
schemaConverters.toAvroType will lost default value. see #2765
In the schema evolution scenario, the default value is very important, avroSchema cares about this。

eg: If we add a new column newCol: string to the table, the default value of newCol will be null
after schemaConverters.toAvroType , the default vaule of newCol will be lost
now if we use this schema to read old avro log（not contains column newCol）， avro will complain that there is no default value, and throw exception.
#7915 The root cause of this pr is that we lost the default value in the conversion process

alexeykudinkin

@xiarixiaoyao raised a few questions, let's hold on landing until we clear these up.

alexeykudinkin · 2023-02-23T17:17:17Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala

+    val avroNameAndSpace = AvroConversionUtils.getAvroRecordNameAndNamespace(tableName)
    val avroSchema = internalSchemaOpt.map { is =>
-      AvroInternalSchemaConverter.convert(is, "schema")
+      AvroInternalSchemaConverter.convert(is, avroNameAndSpace._2 + "." + avroNameAndSpace._1)


For the context: this name/namespace are actually generated from the table name so that qualified name is no better than the previous one (using just "schema").

We need to understand the real root-cause of the issue

alexeykudinkin · 2023-02-24T23:22:45Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala

+    val (recordName, namespace) = AvroConversionUtils.getAvroRecordNameAndNamespace(tableName)
+    val avroSchema = sparkAdapter.getAvroSchemaConverters.toAvroType(structSchema, nullable = false, recordName, namespace)
+    getAvroSchemaWithDefaults(avroSchema, structSchema)
+  }


@xiarixiaoyao i still don't understand why we need to set defaults in the schema. Can you please elaborate on that one?

alexeykudinkin · 2023-02-24T23:24:03Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala

      }
    }

+    val avroNameAndSpace = AvroConversionUtils.getAvroRecordNameAndNamespace(tableName)


nit: we can move this inside the map and also make it val (name, namespace) = getAvroRecordNameAndNamespace(...)

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala

xiarixiaoyao · 2023-02-25T07:46:13Z

@hudi-bot run azure

xiarixiaoyao · 2023-02-27T03:44:34Z

@hudi-bot run azure

hudi-bot · 2023-02-27T21:09:08Z

CI report:

0b0ca82 UNKNOWN
13d095f Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

…annot be read normally by spark (apache#8026) (cherry picked from commit 31e94ab)

…annot be read normally by spark (apache#8026)

…annot be read normally by spark (apache#8026) (cherry picked from commit 31e94ab) # Conflicts: # hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala

…annot be read normally by spark (apache#8026)

…annot be read normally by spark (apache#8026) (cherry picked from commit 31e94ab)

xiarixiaoyao requested review from XuQianJin-Stars and alexeykudinkin February 23, 2023 11:11

danny0405 added area:schema Schema evolution and data types engine:spark Spark integration engine:flink Flink integration labels Feb 23, 2023

danny0405 reviewed Feb 23, 2023

View reviewed changes

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala Outdated Show resolved Hide resolved

xiarixiaoyao commented Feb 23, 2023

View reviewed changes

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala Show resolved Hide resolved

xiarixiaoyao commented Feb 23, 2023

View reviewed changes

xiarixiaoyao changed the title ~~[HUDI-5835] After performing the update operation, the hodie table cannot be read normally by spark~~ [HUDI-5835] After performing the update operation, the hoodie table cannot be read normally by spark Feb 23, 2023

XuQianJin-Stars approved these changes Feb 23, 2023

View reviewed changes

alexeykudinkin requested changes Feb 23, 2023

View reviewed changes

alexeykudinkin approved these changes Feb 24, 2023

View reviewed changes

alexeykudinkin added priority:blocker Production down; release blocker release-0.13.1 labels Feb 24, 2023

xiarixiaoyao force-pushed the logcc branch 3 times, most recently from 0b0ca82 to edc3bd9 Compare February 25, 2023 02:13

xiarixiaoyao force-pushed the logcc branch from 95dd16b to 0ac5650 Compare February 27, 2023 01:13

xiarixiaoyao force-pushed the logcc branch 3 times, most recently from 180059d to 56c75fe Compare February 27, 2023 10:08

[HUDI-5835] spark cannot read mor table after execute update statement

75c21cc

xiarixiaoyao force-pushed the logcc branch from 56c75fe to 0c7f4d7 Compare February 27, 2023 11:50

fix test failed

13d095f

xiarixiaoyao force-pushed the logcc branch from 0c7f4d7 to 13d095f Compare February 27, 2023 12:30

xiarixiaoyao merged commit 31e94ab into apache:master Feb 28, 2023

stream2000 mentioned this pull request Feb 28, 2023

[HUDI-5759] Supports add column on mor table with log #7915

Closed

4 tasks

danny0405 pushed a commit to danny0405/hudi that referenced this pull request Mar 23, 2023

[HUDI-5835] After performing the update operation, the hoodie table c…

6abea12

…annot be read normally by spark (apache#8026) (cherry picked from commit 31e94ab)

danny0405 pushed a commit to danny0405/hudi that referenced this pull request Mar 23, 2023

[HUDI-5835] After performing the update operation, the hoodie table c…

47b787c

…annot be read normally by spark (apache#8026) (cherry picked from commit 31e94ab)

nsivabalan pushed a commit to nsivabalan/hudi that referenced this pull request Mar 23, 2023

[HUDI-5835] After performing the update operation, the hoodie table c…

e7dbc84

…annot be read normally by spark (apache#8026) (cherry picked from commit 31e94ab)

nsivabalan pushed a commit to nsivabalan/hudi that referenced this pull request Mar 23, 2023

[HUDI-5835] After performing the update operation, the hoodie table c…

dc10c44

…annot be read normally by spark (apache#8026) (cherry picked from commit 31e94ab)

fengjian428 pushed a commit to fengjian428/hudi that referenced this pull request Apr 5, 2023

[HUDI-5835] After performing the update operation, the hoodie table c…

4e24042

…annot be read normally by spark (apache#8026)

stayrascal pushed a commit to stayrascal/hudi that referenced this pull request Apr 20, 2023

[HUDI-5835] After performing the update operation, the hoodie table c…

f7dfaae

…annot be read normally by spark (apache#8026)

TengHuo mentioned this pull request Jul 3, 2023

[SUPPORT] Inconsistent reader and writer schema in HoodieAvroDataBlock cause exception #7284

Closed

KnightChess pushed a commit to KnightChess/hudi that referenced this pull request Jan 2, 2024

[HUDI-5835] After performing the update operation, the hoodie table c…

48d2b6c

…annot be read normally by spark (apache#8026) (cherry picked from commit 31e94ab)

[HUDI-5835] After performing the update operation, the hoodie table cannot be read normally by spark #8026

[HUDI-5835] After performing the update operation, the hoodie table cannot be read normally by spark #8026

Uh oh!

Conversation

xiarixiaoyao commented Feb 23, 2023

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

xiarixiaoyao commented Feb 23, 2023

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiarixiaoyao Feb 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexeykudinkin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xiarixiaoyao commented Feb 25, 2023

Uh oh!

xiarixiaoyao commented Feb 27, 2023

Uh oh!

hudi-bot commented Feb 27, 2023

CI report:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

xiarixiaoyao Feb 23, 2023 •

edited

Loading