-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-3430] Fix Deltastreamer to properly shut down the services upon failure #4824
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-3430] Fix Deltastreamer to properly shut down the services upon failure #4824
Conversation
zhangyue19921010
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM just a minor question :) Thanks for this contribution.
| } | ||
| } finally { | ||
| shutdownAsyncServices(error); | ||
| executor.shutdownNow(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi ethan, we could use shutdownAsyncServices(error)to close async compaction and async clustering properly, and quit the loop/finished Thread directly. Is it necessary to still call executor.shutdownNow();here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to remove it, but then the DeltaStreamer Spark app would not exit. This is to shut down the DeltaSync service's executor thread pool.
xushiyan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/async/HoodieAsyncService.java
Outdated
Show resolved
Hide resolved
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
Show resolved
Hide resolved
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
Outdated
Show resolved
Hide resolved
nsivabalan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one nit clarification. really good triaging and fix.
What is the purpose of the pull request
In the continuous mode with async table services turned on such as compaction and clustering, Deltastreamer does not properly shut down the async table services and the delta sync service upon failure from the commit action not in the table service. This causes the Spark application to hang without doing any work, and the Spark app does not exit in this case. The root cause is due to a cyclic shutdown logic causing the delta sync service and the async table service to wait for each other to terminate. This PR fixes the shutdown logic so that after any failure, from the commit action or async table services, the Deltastreamer can shut down all services properly and the Spark app can exit.
Specifically, through Remote JVM Debug and YourKit Java profiling, the sequence of events below during shutdown is followed, causing endless waiting, and the executor of
DeltaSyncServiceis not properly shut down:Brief change log
Verify this pull request
Three failure cases are tested locally and Deltastreamer exits in all of them:
(See how the failures are triggered through code at the end)
Sequence of events in Case 1 after the fix:
Sequence of events in Case 2 after the fix:
Sequence of events in Case 3 after the fix:
Simulate failures:
Clean:
BaseHoodieWriteClient.javaClustering:
SparkExecuteClusteringCommitActionExecutor.javaCommit:
BaseHoodieWriteClient.javaCommitter checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.