-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-9591] FG reader based merge handle for COW merge #13580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
caa9dea to
d76ce34
Compare
nsivabalan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
High level seems to be ok.
We need to plug in all stats generation (RLI related stats and secondary index stats) within FG reader.
Once you have that wired in, I can review in detail
| return new HoodieMergeHandleWithChangeLog<>(writeConfig, instantTime, table, recordItr, partitionPath, fileId, taskContextSupplier, keyGeneratorOpt); | ||
| } else { | ||
| if (readerContextOpt.isPresent()) { | ||
| return new FileGroupReaderBasedMergeHandle<>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After this patch, is the regular HoodieMergeHandle even used anywhere?
why can't we remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove it from HoodieMergeHandle from this factory after this patch. I just want to be safe, and plug it to spark for now. I am trying to fix the failures and make it work for spark first.
d76ce34 to
28a6364
Compare
...ent/hudi-client-common/src/main/java/org/apache/hudi/io/FileGroupReaderBasedMergeHandle.java
Outdated
Show resolved
Hide resolved
...ent/hudi-client-common/src/main/java/org/apache/hudi/io/FileGroupReaderBasedMergeHandle.java
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieFileGroupReader.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/table/read/UpdateProcessor.java
Outdated
Show resolved
Hide resolved
...ent/hudi-client-common/src/main/java/org/apache/hudi/io/FileGroupReaderBasedMergeHandle.java
Outdated
Show resolved
Hide resolved
nsivabalan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
high level feedback:
- FG reader merge handle is not yet fixed for callers for COW merge code paths.
- We need to fix BaseMergeHelper for COW merge code paths.
- Secondary Index stats generation looks ok.
| private HoodieReadStats readStats; | ||
| private final HoodieRecord.HoodieRecordType recordType; | ||
| private final Option<HoodieCDCLogger> cdcLogger; | ||
| private final Iterator<HoodieRecord<T>> recordIterator; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we make this optional instead of dealing w/ nulls
...ent/hudi-client-common/src/main/java/org/apache/hudi/io/FileGroupReaderBasedMergeHandle.java
Show resolved
Hide resolved
...ent/hudi-client-common/src/main/java/org/apache/hudi/io/FileGroupReaderBasedMergeHandle.java
Show resolved
Hide resolved
...ent/hudi-client-common/src/main/java/org/apache/hudi/io/FileGroupReaderBasedMergeHandle.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieFileGroupReader.java
Outdated
Show resolved
Hide resolved
| public class InputBasedFileGroupRecordBuffer<T> extends KeyBasedFileGroupRecordBuffer<T> { | ||
| private final Iterator<T> inputRecordIterator; | ||
|
|
||
| public InputBasedFileGroupRecordBuffer(HoodieReaderContext readerContext, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RecordIteratorBasedFileGroupRecordBuffer
718832d to
172ef47
Compare
nsivabalan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feedback is not yet fully addressed :(
I did callout that BaseMergeHelper needs to be fixed so that cow merge uses new way of merging. i.e.
instead of calling
mergeHandle.write(HoodieRecord<T> oldRecord)
we need to leverage
mergeHandle.write()
so that FG reader will be used.
how did you even validate that new FG reader is used for COW merges.
can you point me to test case where you validated this
...ent/hudi-client-common/src/main/java/org/apache/hudi/io/FileGroupReaderBasedMergeHandle.java
Outdated
Show resolved
Hide resolved
...ent/hudi-client-common/src/main/java/org/apache/hudi/io/FileGroupReaderBasedMergeHandle.java
Show resolved
Hide resolved
...ent/hudi-client-common/src/main/java/org/apache/hudi/io/FileGroupReaderBasedMergeHandle.java
Outdated
Show resolved
Hide resolved
| try (HoodieFileGroupReader<T> fileGroupReader = HoodieFileGroupReader.<T>newBuilder().withReaderContext(readerContext).withHoodieTableMetaClient(hoodieTable.getMetaClient()) | ||
| .withLatestCommitTime(maxInstantTime).withPartitionPath(partitionPath).withBaseFileOption(Option.ofNullable(baseFileToMerge)).withLogFiles(logFiles) | ||
| .withDataSchema(writeSchemaWithMetaFields).withRequestedSchema(writeSchemaWithMetaFields).withInternalSchema(internalSchemaOption).withProps(props) | ||
| try (HoodieFileGroupReader<T> fileGroupReader = HoodieFileGroupReader.<T>newBuilder() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we instantiate the builder outside.
and only call either of
if (operation.isEmpty()) {
fileGroupReaderBuilder.setLogFiles(logFiles);
} else {
fileGroupReaderBuilder.withRecordIterator(engineRecordIterator);
}
fileGroupReaderBuilder.build();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think i did similar things. Anyways, grouping some of their attributes separately.
| @Override | ||
| public void onUpdate(String recordKey, T previousRecord, T mergedRecord) { | ||
| HoodieKey hoodieKey = new HoodieKey(recordKey, partitionPath); | ||
| BufferedRecord<T> bufferedPrevousRecord = BufferedRecord.forRecordWithContext( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor typo.
bufferedPreviousRecord.
"i" is missing in "previous"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| mergedRecord, writeSchemaWithMetaFields, readerContext, Option.empty(), false); | ||
| SecondaryIndexStreamingTracker.trackSecondaryIndexStats( | ||
| hoodieKey, | ||
| Option.of(readerContext.constructHoodieRecord(bufferedMergedRecord)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, internally (w/n FG reader), we translate HoodieRecord to engine specific representation and wrap it using BufferedRecord (FG reader processing). and here, we get notifications on engine specific record. We create BufferedRecord and then construct HoodieRecord from bufferedRecord.
why not directly create HoodieRecord from engine specific representation here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of converting to HoodieRecord, can we just directly use the field accessors provided by the ReaderContext?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The secondaryIndex key can contains multiple columns. AFAIK, readercontext.getValue is only for single column. We can revisit here.
why not directly create HoodieRecord from engine specific representation here?
I think the reason is that HoodieRecord provides the api to fetch column values, does not.
| HoodieReaderContext<T> readerContext = table.getContext().<T>getReaderContextFactory(table.getMetaClient()).getContext(); | ||
| mergeHandle = HoodieMergeHandleFactory.create( | ||
| operationType, config, instantTime, table, recordItr, partitionPath, fileId, taskContextSupplier, keyGeneratorOpt, | ||
| readerContext, HoodieRecord.HoodieRecordType.SPARK); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can't blindly choose Spark as the record type here.
HoodieWriteConfig exposes getRecordMerger() which inturn will expose getRecordType().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right.
...-common/src/main/java/org/apache/hudi/common/table/read/InputBasedFileGroupRecordBuffer.java
Show resolved
Hide resolved
| try { | ||
| invoker.invoke(callback); | ||
| } catch (Exception e) { | ||
| LOG.error(String.format("Callback %s failed: ", callback.getName()), e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use {}
the-other-tim-brown
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@linliu-code can you ensure that there are tests that include the validation of the commit stats for this path? In the past we have seen a lot of deviations from how the stats are computed with the new FGReader based handles.
|
Is this ready for the final review? |
| HoodieMergeHandle mergeHandle = HoodieMergeHandleFactory.create(operationType, config, instantTime, table, recordItr, partitionPath, fileId, | ||
| taskContextSupplier, keyGeneratorOpt); | ||
| HoodieMergeHandle mergeHandle; | ||
| if (config.getMergeHandleClassName().equals(FileGroupReaderBasedMergeHandle.class.getName())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this does not take effect unless we switch the default value for HoodieWriteConfig.MERGE_HANDLE_CLASS_NAME config property.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I updated it.
I tired executing a simple COW merge w/ this patch, and I do see we are hitting BaseMergeHelper.consume() method :( I left a comment above here #13580 wrt the fix we might need to make. |
544ffb4 to
0c2d7fb
Compare
0c2d7fb to
cd68aee
Compare
Will add test properly after functional tests are runnable. |
cd68aee to
60ea304
Compare
| + "columns are disabled. Please choose the right key generator if you wish to disable meta fields.", e); | ||
| } | ||
| } | ||
| if (config.getMergeHandleClassName().equals(FileGroupReaderBasedMergeHandle.class.getName())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should fix the HoodieMergeHandleFactory instead of here.
…repancies in BufferedRecord
| } | ||
| return new BufferedRecord<>(recordKey, record.getOrderingValue(schema, props), record.getData(), schemaId, isDelete); | ||
| T row = record.getData(); | ||
| return new BufferedRecord<>(recordKey, readerContext.getOrderingValue(row, schema, orderingFieldName), row, schemaId, isDelete); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should use readerContext.getOrderingValue( instead of record.getOrderingValue( to unify the ordering value format in BufferedRecord, record.getOrderingValue( is engine agnostic(maybe used in serialized value such as DeleteRecord).
| } | ||
| } | ||
|
|
||
| private static class CompositeCallback<T> implements BaseFileUpdateCallback<T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can implement RLI tracing in similiair way.
yihua
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes are covered by #13699 which is merged.
Change Logs
This PR introduces:
Impact
Unify COW write path using FG reader.
Risk level (write none, low medium or high below)
Medium.
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".
ticket number here and follow the instruction to make
changes to the website.
Contributor's checklist