-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-46006][YARN] YarnAllocator miss clean targetNumExecutorsPerResourceProfileId after YarnSchedulerBackend call stop #43906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ourceProfileId after YarnSchedulerBackend call stop
|
The root cause of #38622 cc @tgravescs @pan3793 Only with pr #38622, YarnAllocator still allocator many failed container until sc stop finished. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Do you think we can have a test for the reported case?
Is this a regression at Spark 3.4.0?
|
@AngersZhuuuu thanks for digging into this, I think this a good fix, and SPARK-39601 still may be a valid supplement for the same kind of issues - executor launches after the Driver is shutdown, then errors occur.
|
tgravescs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change makes sense based off of previous logic before the stage level scheduling stuff.
We meet a case that user call sc.stop() after run all custom code, but stuck in some place.
what does this mean, the user created a daemon thread pool or something that blocks sc.stop() or why was it stuck?
Why it stuck still in checking, since there is no log and stuck scene. I have add some code in our prod to print the stack when this situation happen again. If I find the reason, I'll update here |
This pr aims not to allocate new executor, https://issues.apache.org/jira/browse/SPARK-39601 can avoid pending allocation request cause app failed by |
|
In our prod, we meet ContextCleaner stuck when waiting reply. But cleaner stop before dagShceduler, but in this pr's case, dagScheduler already call stop. cc @pan3793 @tgravescs
|
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
Outdated
Show resolved
Hide resolved
…ourceProfileId after YarnSchedulerBackend call stop ### What changes were proposed in this pull request? We meet a case that user call sc.stop() after run all custom code, but stuck in some place. Cause below situation 1. User call sc.stop() 2. sc.stop() stuck in some process, but SchedulerBackend.stop was called 3. Since yarn ApplicationMaster didn't finish, still call YarnAllocator.allocateResources() 4. Since driver endpoint stop new allocated executor failed to register 5. untll trigger Max number of executor failures 6. Caused by Before call CoarseGrainedSchedulerBackend.stop() will call YarnSchedulerBackend.requestTotalExecutor() to clean request info  When YarnAllocator handle then empty resource request, since resourceTotalExecutorsWithPreferedLocalities is empty, miss clean targetNumExecutorsPerResourceProfileId.  ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No ### Was this patch authored or co-authored using generative AI tooling? No Closes #43906 from AngersZhuuuu/SPARK-46006. Authored-by: Angerszhuuuu <[email protected]> Signed-off-by: Kent Yao <[email protected]> (cherry picked from commit 06635e2) Signed-off-by: Kent Yao <[email protected]>
…ourceProfileId after YarnSchedulerBackend call stop ### What changes were proposed in this pull request? We meet a case that user call sc.stop() after run all custom code, but stuck in some place. Cause below situation 1. User call sc.stop() 2. sc.stop() stuck in some process, but SchedulerBackend.stop was called 3. Since yarn ApplicationMaster didn't finish, still call YarnAllocator.allocateResources() 4. Since driver endpoint stop new allocated executor failed to register 5. untll trigger Max number of executor failures 6. Caused by Before call CoarseGrainedSchedulerBackend.stop() will call YarnSchedulerBackend.requestTotalExecutor() to clean request info  When YarnAllocator handle then empty resource request, since resourceTotalExecutorsWithPreferedLocalities is empty, miss clean targetNumExecutorsPerResourceProfileId.  ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No ### Was this patch authored or co-authored using generative AI tooling? No Closes #43906 from AngersZhuuuu/SPARK-46006. Authored-by: Angerszhuuuu <[email protected]> Signed-off-by: Kent Yao <[email protected]> (cherry picked from commit 06635e2) Signed-off-by: Kent Yao <[email protected]>
…ourceProfileId after YarnSchedulerBackend call stop ### What changes were proposed in this pull request? We meet a case that user call sc.stop() after run all custom code, but stuck in some place. Cause below situation 1. User call sc.stop() 2. sc.stop() stuck in some process, but SchedulerBackend.stop was called 3. Since yarn ApplicationMaster didn't finish, still call YarnAllocator.allocateResources() 4. Since driver endpoint stop new allocated executor failed to register 5. untll trigger Max number of executor failures 6. Caused by Before call CoarseGrainedSchedulerBackend.stop() will call YarnSchedulerBackend.requestTotalExecutor() to clean request info  When YarnAllocator handle then empty resource request, since resourceTotalExecutorsWithPreferedLocalities is empty, miss clean targetNumExecutorsPerResourceProfileId.  ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No ### Was this patch authored or co-authored using generative AI tooling? No Closes #43906 from AngersZhuuuu/SPARK-46006. Authored-by: Angerszhuuuu <[email protected]> Signed-off-by: Kent Yao <[email protected]> (cherry picked from commit 06635e2) Signed-off-by: Kent Yao <[email protected]>
|
Thanks @AngersZhuuuu for the fix. Thanks @dongjoon-hyun @tgravescs and @pan3793 for the review. Merged to master and 3.5.1,3.4.2,3.3.4 |
| } else { | ||
| false | ||
| if (resourceProfileToTotalExecs.isEmpty) { | ||
| targetNumExecutorsPerResourceProfileId.clear() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AngersZhuuuu Will it be better if we set values in targetNumExecutorsPerResourceProfileId to 0 instead of clearing targetNumExecutorsPerResourceProfileId.
ContainerRequests can be cancelled in YarnAllocator#updateResourceRequests if we set values to 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AngersZhuuuu Will it be better if we set values in
targetNumExecutorsPerResourceProfileIdto 0 instead of clearingtargetNumExecutorsPerResourceProfileId. ContainerRequests can be cancelled inYarnAllocator#updateResourceRequestsif we set values to 0.
Nice catch, will make a follower up pr later
…r to 0 to cancel pending allocate request when driver stop ### What changes were proposed in this pull request? YarnAllocator set target executor number to 0 to cancel pending allocate request when driver stop Now for this issue we do: 1. AllocationFailure should not be treated as exitCausedByApp when driver is shutting down #38622 2. Avoid new allocation requests when sc.stop stuck #43906 3. Cancel pending allocation request, this pr #44036 ### Why are the changes needed? Avoid unnecessary allocate request ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? MT ### Was this patch authored or co-authored using generative AI tooling? No Closes #44036 from AngersZhuuuu/SPARK-46006-FOLLOWUP. Authored-by: Angerszhuuuu <[email protected]> Signed-off-by: Kent Yao <[email protected]>
…r to 0 to cancel pending allocate request when driver stop ### What changes were proposed in this pull request? YarnAllocator set target executor number to 0 to cancel pending allocate request when driver stop Now for this issue we do: 1. AllocationFailure should not be treated as exitCausedByApp when driver is shutting down #38622 2. Avoid new allocation requests when sc.stop stuck #43906 3. Cancel pending allocation request, this pr #44036 ### Why are the changes needed? Avoid unnecessary allocate request ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? MT ### Was this patch authored or co-authored using generative AI tooling? No Closes #44036 from AngersZhuuuu/SPARK-46006-FOLLOWUP. Authored-by: Angerszhuuuu <[email protected]> Signed-off-by: Kent Yao <[email protected]> (cherry picked from commit dbc8756) Signed-off-by: Kent Yao <[email protected]>
…r to 0 to cancel pending allocate request when driver stop ### What changes were proposed in this pull request? YarnAllocator set target executor number to 0 to cancel pending allocate request when driver stop Now for this issue we do: 1. AllocationFailure should not be treated as exitCausedByApp when driver is shutting down #38622 2. Avoid new allocation requests when sc.stop stuck #43906 3. Cancel pending allocation request, this pr #44036 ### Why are the changes needed? Avoid unnecessary allocate request ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? MT ### Was this patch authored or co-authored using generative AI tooling? No Closes #44036 from AngersZhuuuu/SPARK-46006-FOLLOWUP. Authored-by: Angerszhuuuu <[email protected]> Signed-off-by: Kent Yao <[email protected]> (cherry picked from commit dbc8756) Signed-off-by: Kent Yao <[email protected]>
…r to 0 to cancel pending allocate request when driver stop ### What changes were proposed in this pull request? YarnAllocator set target executor number to 0 to cancel pending allocate request when driver stop Now for this issue we do: 1. AllocationFailure should not be treated as exitCausedByApp when driver is shutting down #38622 2. Avoid new allocation requests when sc.stop stuck #43906 3. Cancel pending allocation request, this pr #44036 ### Why are the changes needed? Avoid unnecessary allocate request ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? MT ### Was this patch authored or co-authored using generative AI tooling? No Closes #44036 from AngersZhuuuu/SPARK-46006-FOLLOWUP. Authored-by: Angerszhuuuu <[email protected]> Signed-off-by: Kent Yao <[email protected]> (cherry picked from commit dbc8756) Signed-off-by: Kent Yao <[email protected]>
…ourceProfileId after YarnSchedulerBackend call stop ### What changes were proposed in this pull request? We meet a case that user call sc.stop() after run all custom code, but stuck in some place. Cause below situation 1. User call sc.stop() 2. sc.stop() stuck in some process, but SchedulerBackend.stop was called 3. Since yarn ApplicationMaster didn't finish, still call YarnAllocator.allocateResources() 4. Since driver endpoint stop new allocated executor failed to register 5. untll trigger Max number of executor failures 6. Caused by Before call CoarseGrainedSchedulerBackend.stop() will call YarnSchedulerBackend.requestTotalExecutor() to clean request info  When YarnAllocator handle then empty resource request, since resourceTotalExecutorsWithPreferedLocalities is empty, miss clean targetNumExecutorsPerResourceProfileId.  ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#43906 from AngersZhuuuu/SPARK-46006. Authored-by: Angerszhuuuu <[email protected]> Signed-off-by: Kent Yao <[email protected]> (cherry picked from commit 06635e2) Signed-off-by: Kent Yao <[email protected]>
…r to 0 to cancel pending allocate request when driver stop ### What changes were proposed in this pull request? YarnAllocator set target executor number to 0 to cancel pending allocate request when driver stop Now for this issue we do: 1. AllocationFailure should not be treated as exitCausedByApp when driver is shutting down apache#38622 2. Avoid new allocation requests when sc.stop stuck apache#43906 3. Cancel pending allocation request, this pr apache#44036 ### Why are the changes needed? Avoid unnecessary allocate request ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? MT ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#44036 from AngersZhuuuu/SPARK-46006-FOLLOWUP. Authored-by: Angerszhuuuu <[email protected]> Signed-off-by: Kent Yao <[email protected]> (cherry picked from commit dbc8756) Signed-off-by: Kent Yao <[email protected]>

What changes were proposed in this pull request?
We meet a case that user call sc.stop() after run all custom code, but stuck in some place.
Cause below situation
Before call CoarseGrainedSchedulerBackend.stop() will call YarnSchedulerBackend.requestTotalExecutor() to clean request info

When YarnAllocator handle then empty resource request, since resourceTotalExecutorsWithPreferedLocalities is empty, miss clean targetNumExecutorsPerResourceProfileId.

Why are the changes needed?
Fix bug
Does this PR introduce any user-facing change?
No
How was this patch tested?
No
Was this patch authored or co-authored using generative AI tooling?
No