-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-10885][Streaming]Display the failed output op in Streaming UI #8950
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
1. Display the failed output op count in the batch list 2. Display the failure reason of output op in the batch detail page
|
/cc @tdas |
|
Test build #43133 has finished for PR 8950 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are showing "Succeeded" for successful output ops, then we should also show "Failed" for failed ones as well. In fact, there should be "Failed" and "Failed due to Spark job error". In the first case, the output op error will be below the "Failed" .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, there should be "Failed" and "Failed due to Spark job error". In the first case, the output op error will be below the "Failed" .
I think now we cannot distinguish these two errors. They are both caught in org.apache.spark.streaming.scheduler.Job.run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cant we distinguish based on the whether there is a Spark job error in any of the jobs associated with the output op
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. E.g., the use may write the following codes:
stream.foreachRDD { rdd =>
try {
rdd.foreach(...)
} catch {
...
}
}
In this case, if a Spark job (rdd.foreach) fails, we cannot say the output op fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a possible logic.
If no failure reason, then "Succeeded"
Else if failure reason contains "SparkException", then "Failed due to Spark job error"
Else "Failed"
This should work fine for most cases, where the user is not doing fancy things like catching exceptions themselves and ignoring/rethrowing them. Isnt it?
Consider your example. If the user catches Spark job exception (most probably SparkException) and rethrows it, the above logic should identify it as Spark job error and say "Failed due to Spark job error". On the other hand, if the user catches and ignore exception, then failure reason will be empty and the output will be marked as "Succeeded" (even though Spark job error will not be empty, which is okay).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds great. I will update the logic.
|
Test build #43270 has finished for PR 8950 at commit
|
|
This looks cool! Also I like that you used "details" for that.
|
This seems unnecessary. The job error column has already shown it. |
|
Test build #43278 has finished for PR 8950 at commit
|
|
LGTM. Merging this to master and 1.5. Thanks @zsxwing |
This PR implements the following features for both `master` and `branch-1.5`. 1. Display the failed output op count in the batch list 2. Display the failure reason of output op in the batch detail page Screenshots: <img width="1356" alt="1" src="https://cloud.githubusercontent.com/assets/1000778/10198387/5b2b97ec-67ce-11e5-81c2-f818b9d2f3ad.png"> <img width="1356" alt="2" src="https://cloud.githubusercontent.com/assets/1000778/10198388/5b76ac14-67ce-11e5-8c8b-de2683c5b485.png"> There are still two remaining problems in the UI. 1. If an output operation doesn't run any spark job, we cannot get the its duration since now it's the sum of all jobs' durations. 2. If an output operation doesn't run any spark job, we cannot get the description since it's the latest job's call site. We need to add new `StreamingListenerEvent` about output operations to fix them. So I'd like to fix them only for `master` in another PR. Author: zsxwing <[email protected]> Closes #8950 from zsxwing/batch-failure. (cherry picked from commit ffe6831) Signed-off-by: Tathagata Das <[email protected]>


This PR implements the following features for both
masterandbranch-1.5.Screenshots:


There are still two remaining problems in the UI.
We need to add new
StreamingListenerEventabout output operations to fix them. So I'd like to fix them only formasterin another PR.