[HUDI-9591] FG reader based merge handle for COW merge #13580

linliu-code · 2025-07-19T14:52:06Z

Change Logs

This PR introduces:

Input record based on record buffer that merely creates the internal buffer based on the record iterator.
create constructor in HoodieFileGroupReader to use the above record buffer.
enable above FileGroupReaderBasedMergeHandle to use such FG reader api.
add SI support FileGroupReaderBasedMergeHandle through callbacks.

Impact

Unify COW write path using FG reader.

Risk level (write none, low medium or high below)

Medium.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

nsivabalan

High level seems to be ok.
We need to plug in all stats generation (RLI related stats and secondary index stats) within FG reader.
Once you have that wired in, I can review in detail

nsivabalan · 2025-07-19T22:20:19Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandleFactory.java

        return new HoodieMergeHandleWithChangeLog<>(writeConfig, instantTime, table, recordItr, partitionPath, fileId, taskContextSupplier, keyGeneratorOpt);
      } else {
+        if (readerContextOpt.isPresent()) {
+          return new FileGroupReaderBasedMergeHandle<>(


After this patch, is the regular HoodieMergeHandle even used anywhere?
why can't we remove it.

We can remove it from HoodieMergeHandle from this factory after this patch. I just want to be safe, and plug it to spark for now. I am trying to fix the failures and make it work for spark first.

...ent/hudi-client-common/src/main/java/org/apache/hudi/io/FileGroupReaderBasedMergeHandle.java

hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieFileGroupReader.java

hudi-common/src/main/java/org/apache/hudi/common/table/read/UpdateProcessor.java

...ent/hudi-client-common/src/main/java/org/apache/hudi/io/FileGroupReaderBasedMergeHandle.java

nsivabalan

high level feedback:

FG reader merge handle is not yet fixed for callers for COW merge code paths.
We need to fix BaseMergeHelper for COW merge code paths.
Secondary Index stats generation looks ok.

nsivabalan · 2025-07-21T18:09:18Z

...ent/hudi-client-common/src/main/java/org/apache/hudi/io/FileGroupReaderBasedMergeHandle.java

  private HoodieReadStats readStats;
  private final HoodieRecord.HoodieRecordType recordType;
  private final Option<HoodieCDCLogger> cdcLogger;
+  private final Iterator<HoodieRecord<T>> recordIterator;


can we make this optional instead of dealing w/ nulls

...ent/hudi-client-common/src/main/java/org/apache/hudi/io/FileGroupReaderBasedMergeHandle.java

hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieFileGroupReader.java

nsivabalan · 2025-07-21T18:42:38Z

...-common/src/main/java/org/apache/hudi/common/table/read/InputBasedFileGroupRecordBuffer.java

+public class InputBasedFileGroupRecordBuffer<T> extends KeyBasedFileGroupRecordBuffer<T> {
+  private final Iterator<T> inputRecordIterator;
+
+  public InputBasedFileGroupRecordBuffer(HoodieReaderContext readerContext,


RecordIteratorBasedFileGroupRecordBuffer

nsivabalan

Feedback is not yet fully addressed :(
I did callout that BaseMergeHelper needs to be fixed so that cow merge uses new way of merging. i.e.
instead of calling

mergeHandle.write(HoodieRecord<T> oldRecord)

we need to leverage

mergeHandle.write()

so that FG reader will be used.
how did you even validate that new FG reader is used for COW merges.
can you point me to test case where you validated this

...ent/hudi-client-common/src/main/java/org/apache/hudi/io/FileGroupReaderBasedMergeHandle.java

nsivabalan · 2025-07-22T18:40:49Z

...ent/hudi-client-common/src/main/java/org/apache/hudi/io/FileGroupReaderBasedMergeHandle.java

-    try (HoodieFileGroupReader<T> fileGroupReader = HoodieFileGroupReader.<T>newBuilder().withReaderContext(readerContext).withHoodieTableMetaClient(hoodieTable.getMetaClient())
-        .withLatestCommitTime(maxInstantTime).withPartitionPath(partitionPath).withBaseFileOption(Option.ofNullable(baseFileToMerge)).withLogFiles(logFiles)
-        .withDataSchema(writeSchemaWithMetaFields).withRequestedSchema(writeSchemaWithMetaFields).withInternalSchema(internalSchemaOption).withProps(props)
+    try (HoodieFileGroupReader<T> fileGroupReader = HoodieFileGroupReader.<T>newBuilder()


can we instantiate the builder outside.
and only call either of

if (operation.isEmpty()) { fileGroupReaderBuilder.setLogFiles(logFiles); } else { fileGroupReaderBuilder.withRecordIterator(engineRecordIterator); } fileGroupReaderBuilder.build();

I think i did similar things. Anyways, grouping some of their attributes separately.

nsivabalan · 2025-07-22T18:47:33Z

...ent/hudi-client-common/src/main/java/org/apache/hudi/io/FileGroupReaderBasedMergeHandle.java

+    @Override
+    public void onUpdate(String recordKey, T previousRecord, T mergedRecord) {
+      HoodieKey hoodieKey = new HoodieKey(recordKey, partitionPath);
+      BufferedRecord<T> bufferedPrevousRecord = BufferedRecord.forRecordWithContext(


minor typo.
bufferedPreviousRecord.
"i" is missing in "previous"

nsivabalan · 2025-07-22T18:56:43Z

...ent/hudi-client-common/src/main/java/org/apache/hudi/io/FileGroupReaderBasedMergeHandle.java

+          mergedRecord, writeSchemaWithMetaFields, readerContext, Option.empty(), false);
+      SecondaryIndexStreamingTracker.trackSecondaryIndexStats(
+          hoodieKey,
+          Option.of(readerContext.constructHoodieRecord(bufferedMergedRecord)),


So, internally (w/n FG reader), we translate HoodieRecord to engine specific representation and wrap it using BufferedRecord (FG reader processing). and here, we get notifications on engine specific record. We create BufferedRecord and then construct HoodieRecord from bufferedRecord.

why not directly create HoodieRecord from engine specific representation here.

Instead of converting to HoodieRecord, can we just directly use the field accessors provided by the ReaderContext?

The secondaryIndex key can contains multiple columns. AFAIK, readercontext.getValue is only for single column. We can revisit here.

why not directly create HoodieRecord from engine specific representation here?
I think the reason is that HoodieRecord provides the api to fetch column values, does not.

nsivabalan · 2025-07-22T19:02:21Z

...-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java

+      HoodieReaderContext<T> readerContext = table.getContext().<T>getReaderContextFactory(table.getMetaClient()).getContext();
+      mergeHandle = HoodieMergeHandleFactory.create(
+          operationType, config, instantTime, table, recordItr, partitionPath, fileId, taskContextSupplier, keyGeneratorOpt,
+          readerContext, HoodieRecord.HoodieRecordType.SPARK);


we can't blindly choose Spark as the record type here.
HoodieWriteConfig exposes getRecordMerger() which inturn will expose getRecordType().

...-common/src/main/java/org/apache/hudi/common/table/read/InputBasedFileGroupRecordBuffer.java

nsivabalan · 2025-07-22T19:07:37Z

hudi-common/src/main/java/org/apache/hudi/common/table/read/UpdateProcessor.java

+        try {
+          invoker.invoke(callback);
+        } catch (Exception e) {
+          LOG.error(String.format("Callback %s failed: ", callback.getName()), e);


can we use {}

the-other-tim-brown

@linliu-code can you ensure that there are tests that include the validation of the commit stats for this path? In the past we have seen a lot of deviations from how the stats are computed with the new FGReader based handles.

vinothchandar · 2025-07-23T23:04:42Z

Is this ready for the final review?

nsivabalan · 2025-07-24T00:17:56Z

...-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java

-    HoodieMergeHandle mergeHandle = HoodieMergeHandleFactory.create(operationType, config, instantTime, table, recordItr, partitionPath, fileId,
-        taskContextSupplier, keyGeneratorOpt);
+    HoodieMergeHandle mergeHandle;
+    if (config.getMergeHandleClassName().equals(FileGroupReaderBasedMergeHandle.class.getName())) {


this does not take effect unless we switch the default value for HoodieWriteConfig.MERGE_HANDLE_CLASS_NAME config property.

yes, I updated it.

nsivabalan · 2025-07-24T00:18:42Z

Feedback is not yet fully addressed :( I did callout that BaseMergeHelper needs to be fixed so that cow merge uses new way of merging. i.e. instead of calling
mergeHandle.write(HoodieRecord<T> oldRecord)
we need to leverage
mergeHandle.write() 
so that FG reader will be used. how did you even validate that new FG reader is used for COW merges. can you point me to test case where you validated this

I tired executing a simple COW merge w/ this patch, and I do see we are hitting BaseMergeHelper.consume() method :(
Then, I added breakpoint in FilegroupReaderBasedMergeHandle constructor and I don't see it being invoked. With this patch, we are still using HoodieWriteMergeHandle. :(

I left a comment above here #13580 wrt the fix we might need to make.

linliu-code · 2025-07-24T17:17:45Z

@linliu-code can you ensure that there are tests that include the validation of the commit stats for this path? In the past we have seen a lot of deviations from how the stats are computed with the new FGReader based handles.

Will add test properly after functional tests are runnable.

danny0405 · 2025-07-25T02:58:24Z

...a-client/src/main/java/org/apache/hudi/table/action/commit/BaseJavaCommitActionExecutor.java

            + "columns are disabled. Please choose the right key generator if you wish to disable meta fields.", e);
      }
    }
+    if (config.getMergeHandleClassName().equals(FileGroupReaderBasedMergeHandle.class.getName())) {


We should fix the HoodieMergeHandleFactory instead of here.

…repancies in BufferedRecord

danny0405 · 2025-07-25T03:49:53Z

hudi-common/src/main/java/org/apache/hudi/common/table/read/BufferedRecord.java

    }
-    return new BufferedRecord<>(recordKey, record.getOrderingValue(schema, props), record.getData(), schemaId, isDelete);
+    T row = record.getData();
+    return new BufferedRecord<>(recordKey, readerContext.getOrderingValue(row, schema, orderingFieldName), row, schemaId, isDelete);


we should use readerContext.getOrderingValue( instead of record.getOrderingValue( to unify the ordering value format in BufferedRecord, record.getOrderingValue( is engine agnostic(maybe used in serialized value such as DeleteRecord).

hudi-bot · 2025-07-25T06:17:08Z

CI report:

0c2d7fb UNKNOWN
79e9f89 Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

danny0405 · 2025-08-05T10:04:41Z

...ent/hudi-client-common/src/main/java/org/apache/hudi/io/FileGroupReaderBasedMergeHandle.java

+    }
+  }
+
+  private static class CompositeCallback<T> implements BaseFileUpdateCallback<T> {


We can implement RLI tracing in similiair way.

yihua

The changes are covered by #13699 which is merged.

github-actions bot added the size:M PR with lines of changes in (100, 300] label Jul 19, 2025

linliu-code force-pushed the HUDI-9591-FGReader branch 2 times, most recently from caa9dea to d76ce34 Compare July 19, 2025 17:00

nsivabalan reviewed Jul 19, 2025

View reviewed changes

linliu-code force-pushed the HUDI-9591-FGReader branch from d76ce34 to 28a6364 Compare July 19, 2025 23:32

linliu-code marked this pull request as ready for review July 20, 2025 15:19

github-actions bot added size:L PR with lines of changes in (300, 1000] and removed size:M PR with lines of changes in (100, 300] labels Jul 20, 2025

danny0405 reviewed Jul 21, 2025

View reviewed changes

...ent/hudi-client-common/src/main/java/org/apache/hudi/io/FileGroupReaderBasedMergeHandle.java Outdated Show resolved Hide resolved

danny0405 reviewed Jul 21, 2025

View reviewed changes

...ent/hudi-client-common/src/main/java/org/apache/hudi/io/FileGroupReaderBasedMergeHandle.java Show resolved Hide resolved

danny0405 reviewed Jul 21, 2025

View reviewed changes

hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieFileGroupReader.java Outdated Show resolved Hide resolved

danny0405 reviewed Jul 21, 2025

View reviewed changes

hudi-common/src/main/java/org/apache/hudi/common/table/read/UpdateProcessor.java Outdated Show resolved Hide resolved

danny0405 reviewed Jul 21, 2025

View reviewed changes

...ent/hudi-client-common/src/main/java/org/apache/hudi/io/FileGroupReaderBasedMergeHandle.java Outdated Show resolved Hide resolved

nsivabalan reviewed Jul 21, 2025

View reviewed changes

linliu-code added 4 commits July 22, 2025 03:24

Introduce input based file group record buffer.

c3c38cf

Support record iterator based file group reader merge handle

2861d0b

Plug into spark

0657b9f

Add support for secondary index

172ef47

linliu-code force-pushed the HUDI-9591-FGReader branch from 718832d to 172ef47 Compare July 22, 2025 11:13

linliu-code added 2 commits July 22, 2025 07:36

Refactored

a8d862f

Address comments

e10caf4

nsivabalan reviewed Jul 22, 2025

View reviewed changes

the-other-tim-brown reviewed Jul 23, 2025

View reviewed changes

Address comments

a16e427

linliu-code changed the title ~~[HUDI-9591] Add support of record iterator input for FG reader based merge handle~~ [HUDI-9591] FG reader based merge handle for COW merge Jul 23, 2025

Address comments

1835315

nsivabalan reviewed Jul 24, 2025

View reviewed changes

linliu-code force-pushed the HUDI-9591-FGReader branch from 544ffb4 to 0c2d7fb Compare July 24, 2025 03:36

linliu-code force-pushed the HUDI-9591-FGReader branch from 0c2d7fb to cd68aee Compare July 24, 2025 03:45

Add funtional tests

60ea304

linliu-code force-pushed the HUDI-9591-FGReader branch from cd68aee to 60ea304 Compare July 24, 2025 17:20

linliu-code requested review from danny0405, nsivabalan and the-other-tim-brown July 24, 2025 17:21

danny0405 reviewed Jul 25, 2025

View reviewed changes

Introduce composite update callback, also fix the ordering value disc…

79e9f89

…repancies in BufferedRecord

danny0405 reviewed Jul 25, 2025

View reviewed changes

danny0405 reviewed Aug 5, 2025

View reviewed changes

yihua reviewed Nov 29, 2025

View reviewed changes

yihua closed this Nov 29, 2025

[HUDI-9591] FG reader based merge handle for COW merge #13580

[HUDI-9591] FG reader based merge handle for COW merge #13580

Uh oh!

Conversation

linliu-code commented Jul 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

nsivabalan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nsivabalan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nsivabalan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

the-other-tim-brown left a comment

Choose a reason for hiding this comment

Uh oh!

vinothchandar commented Jul 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nsivabalan commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

linliu-code commented Jul 24, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danny0405 Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

linliu-code commented Jul 19, 2025 •

edited

Loading

nsivabalan commented Jul 24, 2025 •

edited

Loading

danny0405 Jul 25, 2025 •

edited

Loading