-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-3318] [RFC-46] Optimize Record Payload handling #4697
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| While having a single format of the record representation is certainly making implementation of some components simpler, | ||
| it bears unavoidable performance penalty of de-/serialization loop: every record handled by Hudi has to be converted | ||
| from (low-level) engine-specific representation (`Row` for Spark, `DataRow` for Flink, `ArrayWritable` for Hive) into intermediate | ||
| one (Avro), with some operations (like clustering, compaction) potentially incurring this penalty multiple times (on read- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RowData for Flink
0b68148 to
3c89821
Compare
| Stateless component interface providing for API Combining Records will look like following: | ||
|
|
||
| ```java | ||
| interface HoodieRecordCombiningEngine { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Engine sounds like a class; maybe a shorter interface name HoodieSupportsCombine ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Engine would be a class (where you will need to provide different impls for Query Engines you're planning on supporting)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok so here it should be
| interface HoodieRecordCombiningEngine { | |
| class HoodieRecordCombiningEngine { |
| To warrant backward-compatibility (BWC) on the code-level with already created subclasses of `HoodieRecordPayload` currently | ||
| already used in production by Hudi users, we will provide a BWC-bridge in the form of instance of `HoodieRecordCombiningEngine`, that will | ||
| be using user-defined subclass of `HoodieRecordPayload` to combine the records. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in #3893 we have HoodieAvroRecord mostly for compatible migration purpose. Maybe we name it HoodieLegacyAvroRecord so code change can be manageable. A second migration would be needed for HoodieLegacyAvroRecord -> SparkHoodieRecord / FlinkHoodieRecord
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline: the plan is to continue on the migration path of #3893 and introduce HoodieAvroRecord as the intermediate step before we will migrate to engine-specific implementation (SparkHoodieRecord, FlinkHoodieRecord, etc)
rfc/rfc-46/rfc-46.md
Outdated
| 1. `WriteHandle`s will be | ||
| 1. Accepting `HoodieRecord` instead of raw Avro payload (avoiding Avro conversion) | ||
| 2. Using Combining API engine to merge records (when necessary) | ||
| 3. Passes `HoodieRecord` as is to `FileWriter` | ||
| 2. `FileWriter`s will be | ||
| 1. Accepting `HoodieRecord` | ||
| 2. Will be engine-specific (so that they're able to handle internal record representation) | ||
| 3. `RecordReader`s | ||
| 1. API will be returning opaque `HoodieRecord` instead of raw Avro payload |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was trying to look for the exact classes: are they HoodieWriteHandle, HoodieFileWriter and HoodieFileReader ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok would you fix the names so it reflects the exact class names?
rfc/rfc-46/rfc-46.md
Outdated
| - If we need special migration tools, describe them here. | ||
| - No special migration tools will be necessary (other than BWC-bridge to make sure users can use 0.11 out of the box, and there are no breaking changes to the public API) | ||
| - When will we remove the existing behavior | ||
| - In subsequent releases (either 0.12 or 0.13) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the roadmap is like after 0.12 it'll be 1.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor fix ☝️
Tips
What is the purpose of the pull request
Drafting RFC-46
Brief change log
See above
Verify this pull request
N/A
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.