Skip to content

[HUDI-9316] Add support for creating iterator of HoodieRecord from FGReader#13314

Merged
nsivabalan merged 7 commits intoapache:masterfrom
the-other-tim-brown:HUDI-9316-fgreader
May 20, 2025
Merged

[HUDI-9316] Add support for creating iterator of HoodieRecord from FGReader#13314
nsivabalan merged 7 commits intoapache:masterfrom
the-other-tim-brown:HUDI-9316-fgreader

Conversation

@the-other-tim-brown
Copy link
Contributor

@the-other-tim-brown the-other-tim-brown commented May 17, 2025

Change Logs

  • Adds support for generating an iterator of HoodieRecord instead of simply the engine record in the FileGroupReader code. This will aide in migrating existing uses of LogScanner and other reader implementations that expect a HoodieRecord as the output.

Impact

  • Aides in the migration to FileGroupReader in other code paths

Risk level (write none, low medium or high below)

Low

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@github-actions github-actions bot added the size:M PR with lines of changes in (100, 300] label May 17, 2025
HoodieKey hoodieKey = new HoodieKey(bufferedRecord.getRecordKey(), partitionPath);
if (bufferedRecord.isDelete()) {
return new HoodieEmptyRecord<>(
new HoodieKey(bufferedRecord.getRecordKey(), null),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a bug fix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously we did not have a need for the hoodie record outside of the merger fallback code so not a bug but now we have another use case with the new iterator so this partition field should be set.

readerContext.constructHoodieRecord(olderRecord), readerContext.getSchemaFromBufferRecord(olderRecord),
readerContext.constructHoodieRecord(newerRecord), readerContext.getSchemaFromBufferRecord(newerRecord),
readerContext.constructHoodieRecord(olderRecord, partitionPath), readerContext.getSchemaFromBufferRecord(olderRecord),
readerContext.constructHoodieRecord(newerRecord, partitionPath), readerContext.getSchemaFromBufferRecord(newerRecord),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible we put the partitionPath into the readerContext so there is no need to change these apis.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was wondering something similar. If we want to say that the readercontext is built per file group we can do this but if we want to share the reader context between file groups we'll need to keep it in its current form. I don't have a strong opinion either way so open to ideas here about how the class should be used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, looking into the current class I see that there is hasLogFiles which implies the class is currently going to only work per file slice so I will move the partition path to keep the api the same

@the-other-tim-brown
Copy link
Contributor Author

@hudi-bot run azure

assertEquals(expectedRecords.size(), actualRecordList.size());
assertEquals(new HashSet<>(expectedRecords), new HashSet<>(actualRecordList));
// validate records can be read from file group as HoodieRecords
actualRecordList = convertHoodieRecords(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we decouple the test then, the HoodieRecord iterator is a legacy API and should be deprecated already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not a legacy API yet unfortunately. Looking through all the code to migrate, it seems like we have a while before we can retire the HoodieRecord

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@nsivabalan nsivabalan merged commit 39d6645 into apache:master May 20, 2025
58 checks passed
@the-other-tim-brown the-other-tim-brown deleted the HUDI-9316-fgreader branch May 20, 2025 16:32
alexr17 pushed a commit to alexr17/hudi that referenced this pull request Aug 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M PR with lines of changes in (100, 300]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants