Skip to content

Conversation

@the-other-tim-brown
Copy link
Contributor

@the-other-tim-brown the-other-tim-brown commented Aug 12, 2025

Change Logs

This PR fixes a few issues discovered while trying to move the Copy-on-Write path to use the FileGroupReader for reading base files and merging with incoming records. The issues mainly stem from schema evolution cases.

Cases fixed:

  1. Spark reader was not properly rewriting the record in some type promotion scenarios so the validations are updated
  2. The avro reader was not forcing a rewrite for some cases, requiring the validation to be updated to account for these evolutions
  3. The avro reader was not passing in the renamed columns to the transform

Impact

Unblocks moving the writer path to reuse the same reader paths we use elsewhere in the code

Risk level (write none, low medium or high below)

Low

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@github-actions github-actions bot added the size:M PR with lines of changes in (100, 300] label Aug 12, 2025
Comment on lines 147 to 148
fileOutputSchema = dataSchema;
renamedColumns = Collections.emptyMap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to myself: the FileGroupRecordBuffer handles the schema-on-read evolution with composeEvolvedSchemaTransformer for log blocks. Only parquet log blocks requires calling readerContext.getFileRecordIterator before schema-on-read evolution is applied in FileGroupRecordBuffer, thus no need to handle schema-on-read in this case.

Copy link
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall

case DOUBLE:
// To maintain precision, you need to convert Float -> String -> Double
return writerSchema.getType().equals(Schema.Type.FLOAT);
return writerSchema.getType().equals(Schema.Type.FLOAT) && !writerSchema.getType().equals(Schema.Type.STRING);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how could a type equals FLOAT and also STRING ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can get this PR in, I think the areSchemasProjectionEquivalent is going to fit the needs of this and has some better testing. I will wait to see if this can be brought into a mergable shape.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that PR is landed.

return Pair.of(requiredSchema, Collections.emptyMap());
}
long commitInstantTime = Long.parseLong(FSUtils.getCommitTime(path.getName()));
InternalSchema fileSchema = InternalSchemaCache.searchSchemaAndCache(commitInstantTime, metaClient);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems not right, the search happens in file split level, this would trigger the metaClient metadata file listing for every file slice read. Can we reuse the cache somewhere and shared by all the readers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any example of how to do this? I noticed that this is how it is currently done in the merge path. This path will at least cache per JVM. There are some other cases where I see calls to InternalSchemaCache.getInternalSchemaByVersionId but that skips the cache entirely so the commit metadata is parsed per file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we already did this in FileGroupRecordBuffer.composeEvolvedSchemaTransformer, and we have optimized the logic in #13525 to get rid of the timeline listing, should be good now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the existing logic of schema evolution on read in other places follows the same code logic, so this is OK in the sense that it brings feature parity and does not introduce regression.

I think what makes more sense is to have a schema history (schemas for range of completion/instant time, e.g., schema1: ts1-ts100, schema2: ts101-ts1000, etc.) constructed on driver and distribute that to executors. This schema history can be stored under .hoodie so one file read gets the whole schema history and executor does not pay cost of scanning commit metadata or reading schema from file (assuming that the file schema is based on the writer/table schema of the commit). This essentially needs a new schema system / abstraction, which is under the scope of RFC-88 @danny0405

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we have a plan to re-impl the schema evolution based on new schema abstraction in 1.2 release.

Copy link
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yihua
Copy link
Contributor

yihua commented Aug 13, 2025

@the-other-tim-brown you can decide whether the schema utils newly available on master can be reused before merging this PR.

@the-other-tim-brown the-other-tim-brown force-pushed the HUDI-9705-minor-bug-fixdes branch from 1a11deb to 145cb8e Compare August 13, 2025 12:13
@github-actions github-actions bot added size:L PR with lines of changes in (300, 1000] and removed size:M PR with lines of changes in (100, 300] labels Aug 13, 2025
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yihua yihua merged commit ee485c2 into apache:master Aug 13, 2025
61 checks passed
alexr17 pushed a commit to alexr17/hudi that referenced this pull request Aug 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L PR with lines of changes in (300, 1000]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants