-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-9705] Fix bugs in spark and avro reader contexts for type promotion and field renaming #13714
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
4aba220
35b44f2
145cb8e
8c8af8f
3c1e63d
59234fd
f228278
3355983
4cb2fe5
377ee69
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -24,16 +24,21 @@ | |
| import org.apache.hudi.common.config.RecordMergeMode; | ||
| import org.apache.hudi.common.config.TypedProperties; | ||
| import org.apache.hudi.common.engine.HoodieReaderContext; | ||
| import org.apache.hudi.common.fs.FSUtils; | ||
| import org.apache.hudi.common.model.HoodieRecord; | ||
| import org.apache.hudi.common.model.HoodieRecordMerger; | ||
| import org.apache.hudi.common.table.HoodieTableConfig; | ||
| import org.apache.hudi.common.table.HoodieTableMetaClient; | ||
| import org.apache.hudi.common.table.read.buffer.PositionBasedFileGroupRecordBuffer; | ||
| import org.apache.hudi.common.util.InternalSchemaCache; | ||
| import org.apache.hudi.common.util.Option; | ||
| import org.apache.hudi.common.util.VisibleForTesting; | ||
| import org.apache.hudi.common.util.collection.Pair; | ||
| import org.apache.hudi.common.util.collection.Triple; | ||
| import org.apache.hudi.internal.schema.InternalSchema; | ||
| import org.apache.hudi.internal.schema.action.InternalSchemaMerger; | ||
| import org.apache.hudi.internal.schema.convert.AvroInternalSchemaConverter; | ||
| import org.apache.hudi.storage.StoragePath; | ||
|
|
||
| import org.apache.avro.Schema; | ||
|
|
||
|
|
@@ -76,22 +81,24 @@ public class FileGroupReaderSchemaHandler<T> { | |
|
|
||
| protected final TypedProperties properties; | ||
| private final DeleteContext deleteContext; | ||
| private final HoodieTableMetaClient metaClient; | ||
|
|
||
| public FileGroupReaderSchemaHandler(HoodieReaderContext<T> readerContext, | ||
| Schema tableSchema, | ||
| Schema requestedSchema, | ||
| Option<InternalSchema> internalSchemaOpt, | ||
| HoodieTableConfig hoodieTableConfig, | ||
| TypedProperties properties) { | ||
| TypedProperties properties, | ||
| HoodieTableMetaClient metaClient) { | ||
danny0405 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| this.properties = properties; | ||
| this.readerContext = readerContext; | ||
| this.tableSchema = tableSchema; | ||
| this.requestedSchema = AvroSchemaCache.intern(requestedSchema); | ||
| this.hoodieTableConfig = hoodieTableConfig; | ||
| this.hoodieTableConfig = metaClient.getTableConfig(); | ||
| this.deleteContext = new DeleteContext(properties, tableSchema); | ||
| this.requiredSchema = AvroSchemaCache.intern(prepareRequiredSchema(this.deleteContext)); | ||
| this.internalSchema = pruneInternalSchema(requiredSchema, internalSchemaOpt); | ||
| this.internalSchemaOpt = getInternalSchemaOpt(internalSchemaOpt); | ||
| this.metaClient = metaClient; | ||
| } | ||
|
|
||
| public Schema getTableSchema() { | ||
|
|
@@ -125,6 +132,18 @@ public DeleteContext getDeleteContext() { | |
| return deleteContext; | ||
| } | ||
|
|
||
| public Pair<Schema, Map<String, String>> getRequiredSchemaForFileAndRenamedColumns(StoragePath path) { | ||
| if (internalSchema.isEmptySchema()) { | ||
| return Pair.of(requiredSchema, Collections.emptyMap()); | ||
| } | ||
| long commitInstantTime = Long.parseLong(FSUtils.getCommitTime(path.getName())); | ||
| InternalSchema fileSchema = InternalSchemaCache.searchSchemaAndCache(commitInstantTime, metaClient); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. seems not right, the search happens in file split level, this would trigger the metaClient metadata file listing for every file slice read. Can we reuse the cache somewhere and shared by all the readers?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there any example of how to do this? I noticed that this is how it is currently done in the merge path. This path will at least cache per JVM. There are some other cases where I see calls to
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Looks like we already did this in
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, the existing logic of schema evolution on read in other places follows the same code logic, so this is OK in the sense that it brings feature parity and does not introduce regression. I think what makes more sense is to have a schema history (schemas for range of completion/instant time, e.g., schema1: ts1-ts100, schema2: ts101-ts1000, etc.) constructed on driver and distribute that to executors. This schema history can be stored under
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, we have a plan to re-impl the schema evolution based on new schema abstraction in 1.2 release. |
||
| Pair<InternalSchema, Map<String, String>> mergedInternalSchema = new InternalSchemaMerger(fileSchema, internalSchema, | ||
| true, false, false).mergeSchemaGetRenamed(); | ||
| Schema mergedAvroSchema = AvroSchemaCache.intern(AvroInternalSchemaConverter.convert(mergedInternalSchema.getLeft(), requiredSchema.getFullName())); | ||
| return Pair.of(mergedAvroSchema, mergedInternalSchema.getRight()); | ||
| } | ||
|
|
||
| private InternalSchema pruneInternalSchema(Schema requiredSchema, Option<InternalSchema> internalSchemaOption) { | ||
| if (!internalSchemaOption.isPresent()) { | ||
| return InternalSchema.getEmptyInternalSchema(); | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.