-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-4814] Schedules new clustering plan based on latest clustering instant #6574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
6fcda13
Keep a clustering running at the same time.
d6d651e
empty commit
bcc7396
empty commit
c8bebb1
<=
6dd530a
format
277061f
remove the first filter
86efca5
remove double check
7ced8cc
add filter pending clustering instant
b158b5a
NONE
7bdc7a9
Simplify filtering logic
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and the condition below guarantee that the clustering is only scheduled based on the max_commits config. @eric9204 could you double check the logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yihua yes, this is indeed a redundant inspection, I'm testing whether this condition is needed.
By adding these two conditions, it can really be guaranteed that only one clustering is running at the same time, and if there is no completed clustering, no new clustering plan will be generated.
Configure only these three parameters.
'clustering.schedule.enabled'='true',
'clustering.async.enabled'='false',
'clustering.delta_commits'='6',
Configure only these three parameters.
'clustering.schedule.enabled'='true',
'clustering.async.enabled'='true',
'clustering.delta_commits'='6',
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, so the issue we are trying to solve is:
there is a regular writer which just schedules clustering and we have a async clustering job which does the execution of clustering.
if clustering is pending (may be will be executed by an async clustering job), every new successful commit with regular writer will keep adding new replacecommit.requested.
If yes, then the fix makes sense to me.
@yihua @danny0405 : wdyt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but one thing which I am finding it hard to comprehend is. wrt clustering, either both planning and execution is inline. or both are async atleast wrt spark datasource writer. So, not sure how the user ended up where clustering was just scheduled w/o getting to completion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the fix makes sense to me too.
Flink writer schedules a clustering plan on each successful regular commit and there is a async pipeline that executes the clustering continuously, this patch can solve the problem that the clustering plan schedules too frequently if there is pending clustering.
So, +1 from my side.