Skip to content

Conversation

@AngersZhuuuu
Copy link
Contributor

@AngersZhuuuu AngersZhuuuu commented Nov 20, 2023

What changes were proposed in this pull request?

We meet a case that user call sc.stop() after run all custom code, but stuck in some place.

Cause below situation

  1. User call sc.stop()
  2. sc.stop() stuck in some process, but SchedulerBackend.stop was called
  3. Since yarn ApplicationMaster didn't finish, still call YarnAllocator.allocateResources()
  4. Since driver endpoint stop new allocated executor failed to register
  5. untll trigger Max number of executor failures
  6. Caused by

Before call CoarseGrainedSchedulerBackend.stop() will call YarnSchedulerBackend.requestTotalExecutor() to clean request info
image

When YarnAllocator handle then empty resource request, since resourceTotalExecutorsWithPreferedLocalities is empty, miss clean targetNumExecutorsPerResourceProfileId.
image

Why are the changes needed?

Fix bug

Does this PR introduce any user-facing change?

No

How was this patch tested?

No

Was this patch authored or co-authored using generative AI tooling?

No

…ourceProfileId after YarnSchedulerBackend call stop
@github-actions github-actions bot added the YARN label Nov 20, 2023
@AngersZhuuuu
Copy link
Contributor Author

AngersZhuuuu commented Nov 20, 2023

The root cause of #38622 cc @tgravescs @pan3793

Only with pr #38622, YarnAllocator still allocator many failed container until sc stop finished.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Do you think we can have a test for the reported case?
  • Is this a regression at Spark 3.4.0?

@pan3793
Copy link
Member

pan3793 commented Nov 20, 2023

@AngersZhuuuu thanks for digging into this, I think this a good fix, and SPARK-39601 still may be a valid supplement for the same kind of issues - executor launches after the Driver is shutdown, then errors occur.

  • SPARK-39601 aims to suppress the executor error in YARN allocator after the Driver is shutdown, which fixes the issue on the consumer side
  • this PR aims to cancel executor allocation after the Driver is shutdown, which fixes the issue on the producer side, it could reduce the pressure of RM. But I'm not sure if such canceling could kill all executors without error reports, including the pending executors and the launching executors? If yes, SPARK-39601 seems redundant, if not, SPARK-39601 is valid too.

Copy link
Contributor

@tgravescs tgravescs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change makes sense based off of previous logic before the stage level scheduling stuff.

We meet a case that user call sc.stop() after run all custom code, but stuck in some place.

what does this mean, the user created a daemon thread pool or something that blocks sc.stop() or why was it stuck?

@AngersZhuuuu
Copy link
Contributor Author

what does this mean, the user created a daemon thread pool or something that blocks sc.stop() or why was it stuck?

Why it stuck still in checking, since there is no log and stuck scene. I have add some code in our prod to print the stack when this situation happen again. If I find the reason, I'll update here

@AngersZhuuuu
Copy link
Contributor Author

@AngersZhuuuu thanks for digging into this, I think this a good fix, and SPARK-39601 still may be a valid supplement for the same kind of issues - executor launches after the Driver is shutdown, then errors occur.

  • SPARK-39601 aims to suppress the executor error in YARN allocator after the Driver is shutdown, which fixes the issue on the consumer side
  • this PR aims to cancel executor allocation after the Driver is shutdown, which fixes the issue on the producer side, it could reduce the pressure of RM. But I'm not sure if such canceling could kill all executors without error reports, including the pending executors and the launching executors? If yes, SPARK-39601 seems redundant, if not, SPARK-39601 is valid too.

This pr aims not to allocate new executor, https://issues.apache.org/jira/browse/SPARK-39601 can avoid pending allocation request cause app failed by Max number of executor failures when minExecutor value is small.

@AngersZhuuuu
Copy link
Contributor Author

In our prod, we meet ContextCleaner stuck when waiting reply. But cleaner stop before dagShceduler, but in this pr's case, dagScheduler already call stop. cc @pan3793 @tgravescs

截屏2023-11-21 上午11 01 55

@yaooqinn yaooqinn closed this in 06635e2 Nov 22, 2023
yaooqinn pushed a commit that referenced this pull request Nov 22, 2023
…ourceProfileId after YarnSchedulerBackend call stop

### What changes were proposed in this pull request?
We meet a case that user call sc.stop() after run all custom code, but stuck in some place.

Cause below situation

1. User call sc.stop()
2. sc.stop() stuck in some process, but SchedulerBackend.stop was called
3. Since yarn ApplicationMaster didn't finish, still call YarnAllocator.allocateResources()
4. Since driver endpoint stop new allocated executor failed to register
5. untll trigger Max number of executor failures
6. Caused by

Before call CoarseGrainedSchedulerBackend.stop() will call YarnSchedulerBackend.requestTotalExecutor() to clean request info
![image](https://github.com/apache/spark/assets/46485123/4a61fb40-5986-4ecc-9329-369187d5311d)

When YarnAllocator handle then empty resource request,  since resourceTotalExecutorsWithPreferedLocalities is empty, miss clean targetNumExecutorsPerResourceProfileId.
![image](https://github.com/apache/spark/assets/46485123/0133f606-e1d7-4db7-95fe-140c61379102)

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
No

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #43906 from AngersZhuuuu/SPARK-46006.

Authored-by: Angerszhuuuu <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
(cherry picked from commit 06635e2)
Signed-off-by: Kent Yao <[email protected]>
yaooqinn pushed a commit that referenced this pull request Nov 22, 2023
…ourceProfileId after YarnSchedulerBackend call stop

### What changes were proposed in this pull request?
We meet a case that user call sc.stop() after run all custom code, but stuck in some place.

Cause below situation

1. User call sc.stop()
2. sc.stop() stuck in some process, but SchedulerBackend.stop was called
3. Since yarn ApplicationMaster didn't finish, still call YarnAllocator.allocateResources()
4. Since driver endpoint stop new allocated executor failed to register
5. untll trigger Max number of executor failures
6. Caused by

Before call CoarseGrainedSchedulerBackend.stop() will call YarnSchedulerBackend.requestTotalExecutor() to clean request info
![image](https://github.com/apache/spark/assets/46485123/4a61fb40-5986-4ecc-9329-369187d5311d)

When YarnAllocator handle then empty resource request,  since resourceTotalExecutorsWithPreferedLocalities is empty, miss clean targetNumExecutorsPerResourceProfileId.
![image](https://github.com/apache/spark/assets/46485123/0133f606-e1d7-4db7-95fe-140c61379102)

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
No

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #43906 from AngersZhuuuu/SPARK-46006.

Authored-by: Angerszhuuuu <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
(cherry picked from commit 06635e2)
Signed-off-by: Kent Yao <[email protected]>
yaooqinn pushed a commit that referenced this pull request Nov 22, 2023
…ourceProfileId after YarnSchedulerBackend call stop

### What changes were proposed in this pull request?
We meet a case that user call sc.stop() after run all custom code, but stuck in some place.

Cause below situation

1. User call sc.stop()
2. sc.stop() stuck in some process, but SchedulerBackend.stop was called
3. Since yarn ApplicationMaster didn't finish, still call YarnAllocator.allocateResources()
4. Since driver endpoint stop new allocated executor failed to register
5. untll trigger Max number of executor failures
6. Caused by

Before call CoarseGrainedSchedulerBackend.stop() will call YarnSchedulerBackend.requestTotalExecutor() to clean request info
![image](https://github.com/apache/spark/assets/46485123/4a61fb40-5986-4ecc-9329-369187d5311d)

When YarnAllocator handle then empty resource request,  since resourceTotalExecutorsWithPreferedLocalities is empty, miss clean targetNumExecutorsPerResourceProfileId.
![image](https://github.com/apache/spark/assets/46485123/0133f606-e1d7-4db7-95fe-140c61379102)

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
No

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #43906 from AngersZhuuuu/SPARK-46006.

Authored-by: Angerszhuuuu <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
(cherry picked from commit 06635e2)
Signed-off-by: Kent Yao <[email protected]>
@yaooqinn
Copy link
Member

yaooqinn commented Nov 22, 2023

Thanks @AngersZhuuuu for the fix. Thanks @dongjoon-hyun @tgravescs and @pan3793 for the review.

Merged to master and 3.5.1,3.4.2,3.3.4

} else {
false
if (resourceProfileToTotalExecs.isEmpty) {
targetNumExecutorsPerResourceProfileId.clear()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AngersZhuuuu Will it be better if we set values in targetNumExecutorsPerResourceProfileId to 0 instead of clearing targetNumExecutorsPerResourceProfileId.
ContainerRequests can be cancelled in YarnAllocator#updateResourceRequests if we set values to 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AngersZhuuuu Will it be better if we set values in targetNumExecutorsPerResourceProfileId to 0 instead of clearing targetNumExecutorsPerResourceProfileId. ContainerRequests can be cancelled in YarnAllocator#updateResourceRequests if we set values to 0.

Nice catch, will make a follower up pr later

yaooqinn pushed a commit that referenced this pull request Nov 28, 2023
…r to 0 to cancel pending allocate request when driver stop

### What changes were proposed in this pull request?
YarnAllocator set target executor number to 0 to cancel pending allocate request when driver stop
Now for this issue we do:

1. AllocationFailure should not be treated as exitCausedByApp when driver is shutting down #38622
2. Avoid new allocation requests when sc.stop stuck #43906
3. Cancel pending allocation request, this pr #44036

### Why are the changes needed?
Avoid unnecessary allocate request

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
MT

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #44036 from AngersZhuuuu/SPARK-46006-FOLLOWUP.

Authored-by: Angerszhuuuu <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
yaooqinn pushed a commit that referenced this pull request Nov 28, 2023
…r to 0 to cancel pending allocate request when driver stop

### What changes were proposed in this pull request?
YarnAllocator set target executor number to 0 to cancel pending allocate request when driver stop
Now for this issue we do:

1. AllocationFailure should not be treated as exitCausedByApp when driver is shutting down #38622
2. Avoid new allocation requests when sc.stop stuck #43906
3. Cancel pending allocation request, this pr #44036

### Why are the changes needed?
Avoid unnecessary allocate request

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
MT

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #44036 from AngersZhuuuu/SPARK-46006-FOLLOWUP.

Authored-by: Angerszhuuuu <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
(cherry picked from commit dbc8756)
Signed-off-by: Kent Yao <[email protected]>
yaooqinn pushed a commit that referenced this pull request Nov 28, 2023
…r to 0 to cancel pending allocate request when driver stop

### What changes were proposed in this pull request?
YarnAllocator set target executor number to 0 to cancel pending allocate request when driver stop
Now for this issue we do:

1. AllocationFailure should not be treated as exitCausedByApp when driver is shutting down #38622
2. Avoid new allocation requests when sc.stop stuck #43906
3. Cancel pending allocation request, this pr #44036

### Why are the changes needed?
Avoid unnecessary allocate request

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
MT

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #44036 from AngersZhuuuu/SPARK-46006-FOLLOWUP.

Authored-by: Angerszhuuuu <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
(cherry picked from commit dbc8756)
Signed-off-by: Kent Yao <[email protected]>
yaooqinn pushed a commit that referenced this pull request Nov 28, 2023
…r to 0 to cancel pending allocate request when driver stop

### What changes were proposed in this pull request?
YarnAllocator set target executor number to 0 to cancel pending allocate request when driver stop
Now for this issue we do:

1. AllocationFailure should not be treated as exitCausedByApp when driver is shutting down #38622
2. Avoid new allocation requests when sc.stop stuck #43906
3. Cancel pending allocation request, this pr #44036

### Why are the changes needed?
Avoid unnecessary allocate request

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
MT

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #44036 from AngersZhuuuu/SPARK-46006-FOLLOWUP.

Authored-by: Angerszhuuuu <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
(cherry picked from commit dbc8756)
Signed-off-by: Kent Yao <[email protected]>
szehon-ho pushed a commit to szehon-ho/spark that referenced this pull request Feb 7, 2024
…ourceProfileId after YarnSchedulerBackend call stop

### What changes were proposed in this pull request?
We meet a case that user call sc.stop() after run all custom code, but stuck in some place.

Cause below situation

1. User call sc.stop()
2. sc.stop() stuck in some process, but SchedulerBackend.stop was called
3. Since yarn ApplicationMaster didn't finish, still call YarnAllocator.allocateResources()
4. Since driver endpoint stop new allocated executor failed to register
5. untll trigger Max number of executor failures
6. Caused by

Before call CoarseGrainedSchedulerBackend.stop() will call YarnSchedulerBackend.requestTotalExecutor() to clean request info
![image](https://github.com/apache/spark/assets/46485123/4a61fb40-5986-4ecc-9329-369187d5311d)

When YarnAllocator handle then empty resource request,  since resourceTotalExecutorsWithPreferedLocalities is empty, miss clean targetNumExecutorsPerResourceProfileId.
![image](https://github.com/apache/spark/assets/46485123/0133f606-e1d7-4db7-95fe-140c61379102)

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
No

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#43906 from AngersZhuuuu/SPARK-46006.

Authored-by: Angerszhuuuu <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
(cherry picked from commit 06635e2)
Signed-off-by: Kent Yao <[email protected]>
szehon-ho pushed a commit to szehon-ho/spark that referenced this pull request Feb 7, 2024
…r to 0 to cancel pending allocate request when driver stop

### What changes were proposed in this pull request?
YarnAllocator set target executor number to 0 to cancel pending allocate request when driver stop
Now for this issue we do:

1. AllocationFailure should not be treated as exitCausedByApp when driver is shutting down apache#38622
2. Avoid new allocation requests when sc.stop stuck apache#43906
3. Cancel pending allocation request, this pr apache#44036

### Why are the changes needed?
Avoid unnecessary allocate request

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
MT

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#44036 from AngersZhuuuu/SPARK-46006-FOLLOWUP.

Authored-by: Angerszhuuuu <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
(cherry picked from commit dbc8756)
Signed-off-by: Kent Yao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants