Skip to content

Conversation

@zsxwing
Copy link
Member

@zsxwing zsxwing commented Sep 30, 2015

This PR implements the following features for both master and branch-1.5.

  1. Display the failed output op count in the batch list
  2. Display the failure reason of output op in the batch detail page

Screenshots:
1
2

There are still two remaining problems in the UI.

  1. If an output operation doesn't run any spark job, we cannot get the its duration since now it's the sum of all jobs' durations.
  2. If an output operation doesn't run any spark job, we cannot get the description since it's the latest job's call site.

We need to add new StreamingListenerEvent about output operations to fix them. So I'd like to fix them only for master in another PR.

1. Display the failed output op count in the batch list
2. Display the failure reason of output op in the batch detail page
@zsxwing
Copy link
Member Author

zsxwing commented Sep 30, 2015

/cc @tdas

@SparkQA
Copy link

SparkQA commented Sep 30, 2015

Test build #43133 has finished for PR 8950 at commit f940ccf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are showing "Succeeded" for successful output ops, then we should also show "Failed" for failed ones as well. In fact, there should be "Failed" and "Failed due to Spark job error". In the first case, the output op error will be below the "Failed" .

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, there should be "Failed" and "Failed due to Spark job error". In the first case, the output op error will be below the "Failed" .

I think now we cannot distinguish these two errors. They are both caught in org.apache.spark.streaming.scheduler.Job.run.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cant we distinguish based on the whether there is a Spark job error in any of the jobs associated with the output op

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. E.g., the use may write the following codes:

stream.foreachRDD { rdd =>
  try {
    rdd.foreach(...)
  } catch {
     ...
  }
}

In this case, if a Spark job (rdd.foreach) fails, we cannot say the output op fails.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a possible logic.

If no failure reason, then "Succeeded"
Else if failure reason contains "SparkException", then "Failed due to Spark job error"
Else "Failed"

This should work fine for most cases, where the user is not doing fancy things like catching exceptions themselves and ignoring/rethrowing them. Isnt it?

Consider your example. If the user catches Spark job exception (most probably SparkException) and rethrows it, the above logic should identify it as Spark job error and say "Failed due to Spark job error". On the other hand, if the user catches and ignore exception, then failure reason will be empty and the output will be marked as "Succeeded" (even though Spark job error will not be empty, which is okay).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds great. I will update the logic.

@zsxwing
Copy link
Member Author

zsxwing commented Oct 6, 2015

New screenshot:
3

@SparkQA
Copy link

SparkQA commented Oct 6, 2015

Test build #43270 has finished for PR 8950 at commit ca68ac8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tdas
Copy link
Contributor

tdas commented Oct 6, 2015

This looks cool! Also I like that you used "details" for that.

  1. In the case of "Failed", unless the details is opened, there is not indication of failure. So it might be better to show "Failed due to error: $exceptionMessage", and the full stacktrace in the detail.
  2. In case of "Failed due to Spark error", does it make sense to have the error show up in both places?

@zsxwing
Copy link
Member Author

zsxwing commented Oct 6, 2015

In the case of "Failed", unless the details is opened, there is not indication of failure. So it might be better to show "Failed due to error: $exceptionMessage", and the full stacktrace in the detail.

Added it.
4

In case of "Failed due to Spark error", does it make sense to have the error show up in both places?

This seems unnecessary. The job error column has already shown it.

@SparkQA
Copy link

SparkQA commented Oct 6, 2015

Test build #43278 has finished for PR 8950 at commit d7c42c9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tdas
Copy link
Contributor

tdas commented Oct 6, 2015

LGTM. Merging this to master and 1.5. Thanks @zsxwing

asfgit pushed a commit that referenced this pull request Oct 6, 2015
This PR implements the following features for both `master` and `branch-1.5`.
1. Display the failed output op count in the batch list
2. Display the failure reason of output op in the batch detail page

Screenshots:
<img width="1356" alt="1" src="https://cloud.githubusercontent.com/assets/1000778/10198387/5b2b97ec-67ce-11e5-81c2-f818b9d2f3ad.png">
<img width="1356" alt="2" src="https://cloud.githubusercontent.com/assets/1000778/10198388/5b76ac14-67ce-11e5-8c8b-de2683c5b485.png">

There are still two remaining problems in the UI.
1. If an output operation doesn't run any spark job, we cannot get the its duration since now it's the sum of all jobs' durations.
2. If an output operation doesn't run any spark job, we cannot get the description since it's the latest job's call site.

We need to add new `StreamingListenerEvent` about output operations to fix them. So I'd like to fix them only for `master` in another PR.

Author: zsxwing <[email protected]>

Closes #8950 from zsxwing/batch-failure.

(cherry picked from commit ffe6831)
Signed-off-by: Tathagata Das <[email protected]>
@asfgit asfgit closed this in ffe6831 Oct 6, 2015
@zsxwing zsxwing deleted the batch-failure branch October 6, 2015 23:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants