-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-5835] After performing the update operation, the hoodie table cannot be read normally by spark #8026
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@alexeykudinkin @XuQianJin-Stars |
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala
Outdated
Show resolved
Hide resolved
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala
Show resolved
Hide resolved
| val avroNameAndSpace = AvroConversionUtils.getAvroRecordNameAndNamespace(tableName) | ||
| val avroSchema = internalSchemaOpt.map { is => | ||
| AvroInternalSchemaConverter.convert(is, "schema") | ||
| AvroInternalSchemaConverter.convert(is, avroNameAndSpace._2 + "." + avroNameAndSpace._1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should pass avroSchema.getFullName to this convert
Otherwise, the schema converted by the converter will be incompatible with the real schema of the hodie table
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xiarixiaoyao can you please share the stacktrace you've observed? Avro name/namespaces shouldn't matter in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the context: this name/namespace are actually generated from the table name so that qualified name is no better than the previous one (using just "schema").
We need to understand the real root-cause of the issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexeykudinkin thanks for your review.
- schema evolution has nothing to do with this scene,since schema evolution will call HoodieAvroUtils.rewriteRecordWithNewSchema to uinfy namespace. i change this line just want to ensure that the namespace of reading schema and writing schema are consistent.
- The namespace of the schema used by hudi when writing the log is from tableName, but the namespace of read schema is “schema"
- When the schema evolution is not enabled,For decimal types, different namespaces produce different names, avro is name sensitive. we should keep the read schema and write schema has the same namespace just as previous versions of hudi
eg:
ff decimal(38, 10)
hudi log write schema will be : {"name":"ff","type":[{"type":"fixed","name":"fixed","namespace":"hoodie.h0.h0_record.ff","size":16,"logicalType":"decimal","precision":38,"scale":10}
spark read schema will be
ff type is : "name":"ff","type":[{"type":"fixed","name":"fixed","namespace":"Record.ff","size":16,"logicalType":"decimal","precision":38,"scale":10},"null"]}
the read schema and write schema is incompatible, we cannot use read schema to read log。previous versions of hudi does not have this problem
Caused by: org.apache.avro.AvroTypeException: Found hoodie.h0.h0_record.ff.fixed, expecting union
at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:308)
at org.apache.avro.io.parsing.Parser.advance(Parser.java:86)
at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:275)
at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:188)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:161)
at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:260)
at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:248)
at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:180)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:161)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:154)
at org.apache.hudi.common.table.log.block.HoodieAvroDataBlock$RecordIterator.next(HoodieAvroDataBlock.java:201)
at org.apache.hudi.common.table.log.block.HoodieAvroDataBlock$RecordIterator.next(HoodieAvroDataBlock.java:149)
| val (recordName, namespace) = AvroConversionUtils.getAvroRecordNameAndNamespace(tableName) | ||
| val avroSchema = sparkAdapter.getAvroSchemaConverters.toAvroType(structSchema, nullable = false, recordName, namespace) | ||
| getAvroSchemaWithDefaults(avroSchema, structSchema) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xiarixiaoyao i still don't understand why we need to set defaults in the schema. Can you please elaborate on that one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes,
schemaConverters.toAvroType will lost default value. see #2765
In the schema evolution scenario, the default value is very important, avroSchema cares about this。
eg: If we add a new column newCol: string to the table, the default value of newCol will be null
after schemaConverters.toAvroType , the default vaule of newCol will be lost
now if we use this schema to read old avro log(not contains column newCol), avro will complain that there is no default value, and throw exception.
#7915 The root cause of this pr is that we lost the default value in the conversion process
alexeykudinkin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xiarixiaoyao raised a few questions, let's hold on landing until we clear these up.
| val avroNameAndSpace = AvroConversionUtils.getAvroRecordNameAndNamespace(tableName) | ||
| val avroSchema = internalSchemaOpt.map { is => | ||
| AvroInternalSchemaConverter.convert(is, "schema") | ||
| AvroInternalSchemaConverter.convert(is, avroNameAndSpace._2 + "." + avroNameAndSpace._1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the context: this name/namespace are actually generated from the table name so that qualified name is no better than the previous one (using just "schema").
We need to understand the real root-cause of the issue
| val (recordName, namespace) = AvroConversionUtils.getAvroRecordNameAndNamespace(tableName) | ||
| val avroSchema = sparkAdapter.getAvroSchemaConverters.toAvroType(structSchema, nullable = false, recordName, namespace) | ||
| getAvroSchemaWithDefaults(avroSchema, structSchema) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xiarixiaoyao i still don't understand why we need to set defaults in the schema. Can you please elaborate on that one?
| } | ||
| } | ||
|
|
||
| val avroNameAndSpace = AvroConversionUtils.getAvroRecordNameAndNamespace(tableName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can move this inside the map and also make it val (name, namespace) = getAvroRecordNameAndNamespace(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala
Outdated
Show resolved
Hide resolved
0b0ca82 to
edc3bd9
Compare
|
@hudi-bot run azure |
|
@hudi-bot run azure |
180059d to
56c75fe
Compare
…annot be read normally by spark (apache#8026) (cherry picked from commit 31e94ab)
…annot be read normally by spark (apache#8026) (cherry picked from commit 31e94ab)
…annot be read normally by spark (apache#8026) (cherry picked from commit 31e94ab)
…annot be read normally by spark (apache#8026) (cherry picked from commit 31e94ab)
…annot be read normally by spark (apache#8026)
…annot be read normally by spark (apache#8026) (cherry picked from commit 31e94ab) # Conflicts: # hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala
…annot be read normally by spark (apache#8026)
…annot be read normally by spark (apache#8026) (cherry picked from commit 31e94ab)
Change Logs
Fix the bug: After performing the update operation, the hodie table cannot be read normally by spark
Impact
if user create table with decimal type,
after performing the update operation, the hodie table cannot be read normally by spark
Risk level (write none, low medium or high below)
high
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change
ticket number here and follow the instruction to make
changes to the website.
Contributor's checklist