Skip to content

[HUDI-9248] Unify code paths for all write operations about bulk_insert#13360

Merged
danny0405 merged 3 commits intoapache:masterfrom
TheR1sing3un:feat_unify_bulk_insert
Jun 4, 2025
Merged

[HUDI-9248] Unify code paths for all write operations about bulk_insert#13360
danny0405 merged 3 commits intoapache:masterfrom
TheR1sing3un:feat_unify_bulk_insert

Conversation

@TheR1sing3un
Copy link
Copy Markdown
Member

refactor: Unify code paths for all bulk_Insert to improve code readability and maintainability

Change Logs

  1. Unify all the code paths of bulk insert operations

Impact

Improve code maintainability

Risk level (write none, low medium or high below)

low

Documentation Update

none

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

1. Unify all the code paths of bulk insert operations

Signed-off-by: TheR1sing3un <chaoyang@apache.org>
@github-actions github-actions bot added the size:M PR with lines of changes in (100, 300] label May 26, 2025
1. fix ut

Signed-off-by: TheR1sing3un <chaoyang@apache.org>
…it metadata

1. pass the extra metadata about spark-streaming checkpoint to commit metadata

Signed-off-by: TheR1sing3un <chaoyang@apache.org>
@TheR1sing3un TheR1sing3un force-pushed the feat_unify_bulk_insert branch from 6f93787 to 791ef28 Compare May 27, 2025 10:53
@TheR1sing3un
Copy link
Copy Markdown
Member Author

@hudi-bot run azure

2 similar comments
@TheR1sing3un
Copy link
Copy Markdown
Member Author

@hudi-bot run azure

@TheR1sing3un
Copy link
Copy Markdown
Member Author

@hudi-bot run azure

@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@TheR1sing3un
Copy link
Copy Markdown
Member Author

All checks passed

@TheR1sing3un
Copy link
Copy Markdown
Member Author

@danny0405 @zhangyue19921010 Hi, Danny, Yue, I close the previous pr: #13066 and reopen it in this pr. The changes to this pr were made based on the suggestions in your last review. How about starting the review again?


}
}
val (writeSuccessful, compactionInstant, clusteringInstant) = commitAndPerformPostOperations(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the change because of overwriteOperationType never null?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image In the previous execution logic, for the bulk insert write path that was not overwrite, `commit` logic was included. Therefore, it only needed to perform `meta sync`. Now I integrate all the bulk insert code paths. Currently, all the `commit` logic is carried out here.

+ " To use row writer please switch to spark 3");
}

records.write().format(targetFormat)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you elaborate why logic is customized before for this executor?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you elaborate why logic is customized before for this executor?

I think the timeline is like this.
First, there was a normal bulk insert logic, and at that time, the interface of data source v2 was directly used to perform writes.
Later, boneanxs proposed to use bulk insert to perform other operations such as overwrite, but the code path was not integrated at that time. Instead, the logic of this part was retained.
You can refer to: #8076
image

@TheR1sing3un TheR1sing3un requested a review from danny0405 June 4, 2025 09:31
@danny0405 danny0405 merged commit 1533bac into apache:master Jun 4, 2025
58 checks passed
alexr17 pushed a commit to alexr17/hudi that referenced this pull request Aug 25, 2025
…rt (apache#13360)

* refactor: Unify all the code paths of bulk insert operations

---------

Signed-off-by: TheR1sing3un <chaoyang@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M PR with lines of changes in (100, 300]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants