[SPARK-46006][YARN] YarnAllocator miss clean targetNumExecutorsPerResourceProfileId after YarnSchedulerBackend call stop #43906

AngersZhuuuu · 2023-11-20T10:03:10Z

What changes were proposed in this pull request?

We meet a case that user call sc.stop() after run all custom code, but stuck in some place.

Cause below situation

User call sc.stop()
sc.stop() stuck in some process, but SchedulerBackend.stop was called
Since yarn ApplicationMaster didn't finish， still call YarnAllocator.allocateResources()
Since driver endpoint stop new allocated executor failed to register
untll trigger Max number of executor failures
Caused by

Before call CoarseGrainedSchedulerBackend.stop() will call YarnSchedulerBackend.requestTotalExecutor() to clean request info

When YarnAllocator handle then empty resource request, since resourceTotalExecutorsWithPreferedLocalities is empty, miss clean targetNumExecutorsPerResourceProfileId.

Why are the changes needed?

Fix bug

Does this PR introduce any user-facing change?

No

How was this patch tested?

No

Was this patch authored or co-authored using generative AI tooling?

No

…ourceProfileId after YarnSchedulerBackend call stop

AngersZhuuuu · 2023-11-20T10:06:42Z

The root cause of #38622 cc @tgravescs @pan3793

Only with pr #38622, YarnAllocator still allocator many failed container until sc stop finished.

dongjoon-hyun

Do you think we can have a test for the reported case?
~~Is this a regression at Spark 3.4.0?~~

pan3793 · 2023-11-20T13:23:39Z

@AngersZhuuuu thanks for digging into this, I think this a good fix, and SPARK-39601 still may be a valid supplement for the same kind of issues - executor launches after the Driver is shutdown, then errors occur.

SPARK-39601 aims to suppress the executor error in YARN allocator after the Driver is shutdown, which fixes the issue on the consumer side
this PR aims to cancel executor allocation after the Driver is shutdown, which fixes the issue on the producer side, it could reduce the pressure of RM. But I'm not sure if such canceling could kill all executors without error reports, including the pending executors and the launching executors? If yes, SPARK-39601 seems redundant, if not, SPARK-39601 is valid too.

tgravescs

Change makes sense based off of previous logic before the stage level scheduling stuff.

We meet a case that user call sc.stop() after run all custom code, but stuck in some place.

what does this mean, the user created a daemon thread pool or something that blocks sc.stop() or why was it stuck?

AngersZhuuuu · 2023-11-21T02:22:18Z

what does this mean, the user created a daemon thread pool or something that blocks sc.stop() or why was it stuck?

Why it stuck still in checking, since there is no log and stuck scene. I have add some code in our prod to print the stack when this situation happen again. If I find the reason, I'll update here

AngersZhuuuu · 2023-11-21T02:23:41Z

@AngersZhuuuu thanks for digging into this, I think this a good fix, and SPARK-39601 still may be a valid supplement for the same kind of issues - executor launches after the Driver is shutdown, then errors occur.

SPARK-39601 aims to suppress the executor error in YARN allocator after the Driver is shutdown, which fixes the issue on the consumer side

this PR aims to cancel executor allocation after the Driver is shutdown, which fixes the issue on the producer side, it could reduce the pressure of RM. But I'm not sure if such canceling could kill all executors without error reports, including the pending executors and the launching executors? If yes, SPARK-39601 seems redundant, if not, SPARK-39601 is valid too.

This pr aims not to allocate new executor, https://issues.apache.org/jira/browse/SPARK-39601 can avoid pending allocation request cause app failed by Max number of executor failures when minExecutor value is small.

AngersZhuuuu · 2023-11-21T03:03:28Z

In our prod, we meet ContextCleaner stuck when waiting reply. But cleaner stop before dagShceduler, but in this pr's case, dagScheduler already call stop. cc @pan3793 @tgravescs

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala

…ourceProfileId after YarnSchedulerBackend call stop ### What changes were proposed in this pull request? We meet a case that user call sc.stop() after run all custom code, but stuck in some place. Cause below situation 1. User call sc.stop() 2. sc.stop() stuck in some process, but SchedulerBackend.stop was called 3. Since yarn ApplicationMaster didn't finish， still call YarnAllocator.allocateResources() 4. Since driver endpoint stop new allocated executor failed to register 5. untll trigger Max number of executor failures 6. Caused by Before call CoarseGrainedSchedulerBackend.stop() will call YarnSchedulerBackend.requestTotalExecutor() to clean request info ![image](https://github.com/apache/spark/assets/46485123/4a61fb40-5986-4ecc-9329-369187d5311d) When YarnAllocator handle then empty resource request, since resourceTotalExecutorsWithPreferedLocalities is empty, miss clean targetNumExecutorsPerResourceProfileId. ![image](https://github.com/apache/spark/assets/46485123/0133f606-e1d7-4db7-95fe-140c61379102) ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No ### Was this patch authored or co-authored using generative AI tooling? No Closes #43906 from AngersZhuuuu/SPARK-46006. Authored-by: Angerszhuuuu <[email protected]> Signed-off-by: Kent Yao <[email protected]> (cherry picked from commit 06635e2) Signed-off-by: Kent Yao <[email protected]>

yaooqinn · 2023-11-22T08:53:32Z

Thanks @AngersZhuuuu for the fix. Thanks @dongjoon-hyun @tgravescs and @pan3793 for the review.

Merged to master and 3.5.1,3.4.2,3.3.4

zhouyifan279 · 2023-11-27T03:33:47Z

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala

-      } else {
-        false
+    if (resourceProfileToTotalExecs.isEmpty) {
+      targetNumExecutorsPerResourceProfileId.clear()


@AngersZhuuuu Will it be better if we set values in targetNumExecutorsPerResourceProfileId to 0 instead of clearing targetNumExecutorsPerResourceProfileId.
ContainerRequests can be cancelled in YarnAllocator#updateResourceRequests if we set values to 0.

@AngersZhuuuu Will it be better if we set values in targetNumExecutorsPerResourceProfileId to 0 instead of clearing targetNumExecutorsPerResourceProfileId. ContainerRequests can be cancelled in YarnAllocator#updateResourceRequests if we set values to 0.

Nice catch, will make a follower up pr later

…r to 0 to cancel pending allocate request when driver stop ### What changes were proposed in this pull request? YarnAllocator set target executor number to 0 to cancel pending allocate request when driver stop Now for this issue we do: 1. AllocationFailure should not be treated as exitCausedByApp when driver is shutting down #38622 2. Avoid new allocation requests when sc.stop stuck #43906 3. Cancel pending allocation request, this pr #44036 ### Why are the changes needed? Avoid unnecessary allocate request ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? MT ### Was this patch authored or co-authored using generative AI tooling? No Closes #44036 from AngersZhuuuu/SPARK-46006-FOLLOWUP. Authored-by: Angerszhuuuu <[email protected]> Signed-off-by: Kent Yao <[email protected]>

…r to 0 to cancel pending allocate request when driver stop ### What changes were proposed in this pull request? YarnAllocator set target executor number to 0 to cancel pending allocate request when driver stop Now for this issue we do: 1. AllocationFailure should not be treated as exitCausedByApp when driver is shutting down #38622 2. Avoid new allocation requests when sc.stop stuck #43906 3. Cancel pending allocation request, this pr #44036 ### Why are the changes needed? Avoid unnecessary allocate request ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? MT ### Was this patch authored or co-authored using generative AI tooling? No Closes #44036 from AngersZhuuuu/SPARK-46006-FOLLOWUP. Authored-by: Angerszhuuuu <[email protected]> Signed-off-by: Kent Yao <[email protected]> (cherry picked from commit dbc8756) Signed-off-by: Kent Yao <[email protected]>

…ourceProfileId after YarnSchedulerBackend call stop ### What changes were proposed in this pull request? We meet a case that user call sc.stop() after run all custom code, but stuck in some place. Cause below situation 1. User call sc.stop() 2. sc.stop() stuck in some process, but SchedulerBackend.stop was called 3. Since yarn ApplicationMaster didn't finish， still call YarnAllocator.allocateResources() 4. Since driver endpoint stop new allocated executor failed to register 5. untll trigger Max number of executor failures 6. Caused by Before call CoarseGrainedSchedulerBackend.stop() will call YarnSchedulerBackend.requestTotalExecutor() to clean request info ![image](https://github.com/apache/spark/assets/46485123/4a61fb40-5986-4ecc-9329-369187d5311d) When YarnAllocator handle then empty resource request, since resourceTotalExecutorsWithPreferedLocalities is empty, miss clean targetNumExecutorsPerResourceProfileId. ![image](https://github.com/apache/spark/assets/46485123/0133f606-e1d7-4db7-95fe-140c61379102) ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#43906 from AngersZhuuuu/SPARK-46006. Authored-by: Angerszhuuuu <[email protected]> Signed-off-by: Kent Yao <[email protected]> (cherry picked from commit 06635e2) Signed-off-by: Kent Yao <[email protected]>

…r to 0 to cancel pending allocate request when driver stop ### What changes were proposed in this pull request? YarnAllocator set target executor number to 0 to cancel pending allocate request when driver stop Now for this issue we do: 1. AllocationFailure should not be treated as exitCausedByApp when driver is shutting down apache#38622 2. Avoid new allocation requests when sc.stop stuck apache#43906 3. Cancel pending allocation request, this pr apache#44036 ### Why are the changes needed? Avoid unnecessary allocate request ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? MT ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#44036 from AngersZhuuuu/SPARK-46006-FOLLOWUP. Authored-by: Angerszhuuuu <[email protected]> Signed-off-by: Kent Yao <[email protected]> (cherry picked from commit dbc8756) Signed-off-by: Kent Yao <[email protected]>

[SPARK-46006][YARN] YarnAllocator miss clean targetNumExecutorsPerRes…

f3a0061

…ourceProfileId after YarnSchedulerBackend call stop

github-actions bot added the YARN label Nov 20, 2023

dongjoon-hyun reviewed Nov 20, 2023

View reviewed changes

tgravescs approved these changes Nov 20, 2023

View reviewed changes

yaooqinn reviewed Nov 22, 2023

View reviewed changes

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala Outdated Show resolved Hide resolved

Update YarnAllocator.scala

86f73b0

yaooqinn approved these changes Nov 22, 2023

View reviewed changes

yaooqinn closed this in 06635e2 Nov 22, 2023

zhouyifan279 reviewed Nov 27, 2023

View reviewed changes

AngersZhuuuu mentioned this pull request Nov 27, 2023

[SPARK-46006][YARN][FOLLOWUP] YarnAllocator set target executor number to 0 to cancel pending allocate request when driver stop #44036

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-46006][YARN] YarnAllocator miss clean targetNumExecutorsPerResourceProfileId after YarnSchedulerBackend call stop #43906

[SPARK-46006][YARN] YarnAllocator miss clean targetNumExecutorsPerResourceProfileId after YarnSchedulerBackend call stop #43906

Uh oh!

AngersZhuuuu commented Nov 20, 2023 •

edited by LuciferYang

Loading

Uh oh!

AngersZhuuuu commented Nov 20, 2023 •

edited

Loading

Uh oh!

dongjoon-hyun left a comment •

edited

Loading

Uh oh!

pan3793 commented Nov 20, 2023

Uh oh!

tgravescs left a comment

Uh oh!

AngersZhuuuu commented Nov 21, 2023

Uh oh!

AngersZhuuuu commented Nov 21, 2023

Uh oh!

AngersZhuuuu commented Nov 21, 2023

Uh oh!

Uh oh!

yaooqinn commented Nov 22, 2023 •

edited

Loading

Uh oh!

zhouyifan279 Nov 27, 2023

Uh oh!

AngersZhuuuu Nov 27, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[SPARK-46006][YARN] YarnAllocator miss clean targetNumExecutorsPerResourceProfileId after YarnSchedulerBackend call stop #43906

[SPARK-46006][YARN] YarnAllocator miss clean targetNumExecutorsPerResourceProfileId after YarnSchedulerBackend call stop #43906

Uh oh!

Conversation

AngersZhuuuu commented Nov 20, 2023 • edited by LuciferYang Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

AngersZhuuuu commented Nov 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pan3793 commented Nov 20, 2023

Uh oh!

tgravescs left a comment

Choose a reason for hiding this comment

Uh oh!

AngersZhuuuu commented Nov 21, 2023

Uh oh!

AngersZhuuuu commented Nov 21, 2023

Uh oh!

AngersZhuuuu commented Nov 21, 2023

Uh oh!

Uh oh!

yaooqinn commented Nov 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhouyifan279 Nov 27, 2023

Choose a reason for hiding this comment

Uh oh!

AngersZhuuuu Nov 27, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

AngersZhuuuu commented Nov 20, 2023 •

edited by LuciferYang

Loading

AngersZhuuuu commented Nov 20, 2023 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading

yaooqinn commented Nov 22, 2023 •

edited

Loading