Skip to content

Conversation

@alexeykudinkin
Copy link
Contributor

Tips

What is the purpose of the pull request

Drafting RFC-46

Brief change log

See above

Verify this pull request

N/A

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

While having a single format of the record representation is certainly making implementation of some components simpler,
it bears unavoidable performance penalty of de-/serialization loop: every record handled by Hudi has to be converted
from (low-level) engine-specific representation (`Row` for Spark, `DataRow` for Flink, `ArrayWritable` for Hive) into intermediate
one (Avro), with some operations (like clustering, compaction) potentially incurring this penalty multiple times (on read-
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RowData for Flink

Stateless component interface providing for API Combining Records will look like following:

```java
interface HoodieRecordCombiningEngine {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Engine sounds like a class; maybe a shorter interface name HoodieSupportsCombine ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Engine would be a class (where you will need to provide different impls for Query Engines you're planning on supporting)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok so here it should be

Suggested change
interface HoodieRecordCombiningEngine {
class HoodieRecordCombiningEngine {

Comment on lines +112 to +114
To warrant backward-compatibility (BWC) on the code-level with already created subclasses of `HoodieRecordPayload` currently
already used in production by Hudi users, we will provide a BWC-bridge in the form of instance of `HoodieRecordCombiningEngine`, that will
be using user-defined subclass of `HoodieRecordPayload` to combine the records.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in #3893 we have HoodieAvroRecord mostly for compatible migration purpose. Maybe we name it HoodieLegacyAvroRecord so code change can be manageable. A second migration would be needed for HoodieLegacyAvroRecord -> SparkHoodieRecord / FlinkHoodieRecord

Copy link
Contributor Author

@alexeykudinkin alexeykudinkin Jan 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline: the plan is to continue on the migration path of #3893 and introduce HoodieAvroRecord as the intermediate step before we will migrate to engine-specific implementation (SparkHoodieRecord, FlinkHoodieRecord, etc)

Comment on lines 129 to 137
1. `WriteHandle`s will be
1. Accepting `HoodieRecord` instead of raw Avro payload (avoiding Avro conversion)
2. Using Combining API engine to merge records (when necessary)
3. Passes `HoodieRecord` as is to `FileWriter`
2. `FileWriter`s will be
1. Accepting `HoodieRecord`
2. Will be engine-specific (so that they're able to handle internal record representation)
3. `RecordReader`s
1. API will be returning opaque `HoodieRecord` instead of raw Avro payload
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was trying to look for the exact classes: are they HoodieWriteHandle, HoodieFileWriter and HoodieFileReader ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok would you fix the names so it reflects the exact class names?

- If we need special migration tools, describe them here.
- No special migration tools will be necessary (other than BWC-bridge to make sure users can use 0.11 out of the box, and there are no breaking changes to the public API)
- When will we remove the existing behavior
- In subsequent releases (either 0.12 or 0.13)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the roadmap is like after 0.12 it'll be 1.0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor fix ☝️

@alexeykudinkin alexeykudinkin changed the title [HUDI-3318] Drafted RFC-46 [HUDI-3318] RFC-46 Jan 28, 2022
@hudi-bot
Copy link
Collaborator

hudi-bot commented Feb 1, 2022

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@xushiyan xushiyan changed the title [HUDI-3318] RFC-46 [HUDI-3318] [RFC-46] Optimize Record Payload handling Feb 1, 2022
@xushiyan xushiyan merged commit d3cfe07 into apache:master Feb 1, 2022
liusenhua pushed a commit to liusenhua/hudi that referenced this pull request Mar 1, 2022
vingov pushed a commit to vingov/hudi that referenced this pull request Apr 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants