-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-21503][UI]: Spark UI shows incorrect task status for a killed Executor Process #18707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Added the case ExecutorLostFailure which was previously not there, thus, the default case would be executed in which case, task would be marked as completed.
|
ok to test |
| return | ||
| case _: ExceptionFailure => | ||
| taskSummary.tasksFailed += 1 | ||
| case _: ExecutorLostFailure => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we can use info.successful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi,
I have replaced most of the task failed cases with info.successful. Tested it as well. Have also written unit test to simulate my issue. Thank you.
|
Test build #79850 has finished for PR 18707 at commit
|
|
Test build #79937 has finished for PR 18707 at commit
|
Apache Spark Pull Request - July 26, 2017
…es not create SparkContext Added a flag to check whether user has initialized Spark Context. If it is true, then we let Application Master unregister with Resource Manager else we do not.
|
Test build #79977 has finished for PR 18707 at commit
|
… that does not create SparkContext" This reverts commit f454c89. "Merged another issue to this one by mistake"
|
Test build #79976 has finished for PR 18707 at commit
|
|
Test build #79978 has finished for PR 18707 at commit
|
| // first attempt of this task. This may not be 100% accurate because the first attempt | ||
| // could have failed half-way through. The correct fix would be to keep track of the | ||
| // metrics added by each attempt, but this is much more complicated. | ||
| return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
if (taskEnd.reason == Resubmitted) {
return ;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made the required changes and verified them. Thank you.
| if (info.successful) { | ||
| taskSummary.tasksComplete += 1 | ||
| } else { | ||
| taskSummary.tasksFailed += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this keeps the previous behavior, cc @zsxwing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through the list of possible task failure reasons, and I found info.successful to be updated for all except Resubmitted case. If you think there is a possibility of any other reason that has been overlooked by me, do let me know.
Spark - July 28, 2017
[SPARK-21503]- Making Changes as per comments: Removed match case statement and replaced it with an if clause.
|
Test build #80026 has finished for PR 18707 at commit
|
|
Test build #80027 has finished for PR 18707 at commit
|
| jobListener.onTaskEnd( | ||
| SparkListenerTaskEnd(0, 0, "result", Success, taskInfo, taskMetrics)) | ||
| } | ||
| (3 to 4).foreach { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a test case which uses this method to test that the task is correctly displayed as failed. In that case, is this test suite modification needed ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. I did try to simulate a similar unit test for Executors Tab UI, but it needs creating a SparkUI object which cannot be accessed from StagePageSuite. Hence, I am reverting my unit test for the time being.
Spark - August 1, 2017
[SPARK-21503]: Reverting Unit Test Code - Not needed.
|
Test build #80119 has finished for PR 18707 at commit
|
|
Test build #80122 has finished for PR 18707 at commit
|
|
LGTM, cc @zsxwing @cloud-fan |
| if (taskEnd.reason == Resubmitted) { | ||
| return | ||
| } | ||
| if (info.successful) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the relationship between info.successful and taskEnd.reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def successful: Boolean = finished && !failed && !killed
So if the task state shows finished and it has neither failed nor been killed, then info.successful is true. Now, there could be multiple reasons for a failed or killed task, and taskEnd.reason lists those. However, if task state is SUCCESS, so is the reason and thus, info.successful will be true in case of SUCCESS and false for all the other cases.
| // could have failed half-way through. The correct fix would be to keep track of the | ||
| // metrics added by each attempt, but this is much more complicated. | ||
| return | ||
| case _: ExceptionFailure => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we fix the problem by changing this line to case _: TaskFailedReason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I added info.successful check at the suggestion of @zsxwing and @jiangxb1987 . I feel this covers all the failed and killed cases which TaskFailedReason might have overlooked. However, if you feel otherwise, please let me know.
|
change LGTM @cloud-fan I don't see any issues with the current approach, it was suggested earlier, but I think your suggestion would work also. Functionally I'm not seeing any difference. Did I miss something or are you ok with this way? |
…xecutor Process The executor tab on Spark UI page shows task as completed when an executor process that is running that task is killed using the kill command. Added the case ExecutorLostFailure which was previously not there, thus, the default case would be executed in which case, task would be marked as completed. This case will consider all those cases where executor connection to Spark Driver was lost due to killing the executor process, network connection etc. ## How was this patch tested? Manually Tested the fix by observing the UI change before and after. Before: <img width="1398" alt="screen shot-before" src="https://user-images.githubusercontent.com/22228190/28482929-571c9cea-6e30-11e7-93dd-728de5cdea95.png"> After: <img width="1385" alt="screen shot-after" src="https://user-images.githubusercontent.com/22228190/28482964-8649f5ee-6e30-11e7-91bd-2eb2089c61cc.png"> Please review http://spark.apache.org/contributing.html before opening a pull request. Author: pgandhi <[email protected]> Author: pgandhi999 <[email protected]> Closes #18707 from pgandhi999/master. (cherry picked from commit f016f5c) Signed-off-by: Wenchen Fan <[email protected]>
|
I was just asking about different fixes, I'm ok with the current fix. thanks, merging to master/2.2! |
…xecutor Process The executor tab on Spark UI page shows task as completed when an executor process that is running that task is killed using the kill command. Added the case ExecutorLostFailure which was previously not there, thus, the default case would be executed in which case, task would be marked as completed. This case will consider all those cases where executor connection to Spark Driver was lost due to killing the executor process, network connection etc. ## How was this patch tested? Manually Tested the fix by observing the UI change before and after. Before: <img width="1398" alt="screen shot-before" src="https://user-images.githubusercontent.com/22228190/28482929-571c9cea-6e30-11e7-93dd-728de5cdea95.png"> After: <img width="1385" alt="screen shot-after" src="https://user-images.githubusercontent.com/22228190/28482964-8649f5ee-6e30-11e7-91bd-2eb2089c61cc.png"> Please review http://spark.apache.org/contributing.html before opening a pull request. Author: pgandhi <[email protected]> Author: pgandhi999 <[email protected]> Closes apache#18707 from pgandhi999/master. (cherry picked from commit f016f5c) Signed-off-by: Wenchen Fan <[email protected]>
The executor tab on Spark UI page shows task as completed when an executor process that is running that task is killed using the kill command.
Added the case ExecutorLostFailure which was previously not there, thus, the default case would be executed in which case, task would be marked as completed. This case will consider all those cases where executor connection to Spark Driver was lost due to killing the executor process, network connection etc.
How was this patch tested?
Manually Tested the fix by observing the UI change before and after.


Before:
After:
Please review http://spark.apache.org/contributing.html before opening a pull request.