[SPARK-20713][Spark Core] Convert CommitDenied to TaskKilled.#18070
[SPARK-20713][Spark Core] Convert CommitDenied to TaskKilled.#18070lycplus wants to merge 5 commits intoapache:masterfrom
Conversation
|
Can one of the admins verify this patch? |
|
cc @ericl |
| res | ||
| } catch { | ||
| case _: CommitDeniedException => | ||
| throw new TaskKilledException("commit denied") |
There was a problem hiding this comment.
I'm not sure we want to just convert it here. I was originally thinking we would fix it up on the driver side because it knows that it explicitly a speculative task and it killed it. Here we don't know that for sure.
for instance you might have got a commit denied because the stage was aborted due to fetch failure. That shouldn't show up as killed.
There was a problem hiding this comment.
Maybe we should throw a more specific exception for the already committed case ? Executor can know about the already committed case before it sends the statusUpdate, so we do not need to wait until the driver's statusUpdate handles the commitDenied case.
As to because it knows that it explicitly a speculative task and it killed it, this may not be true. Consider the case that the statusUpdate of the committedDenied task comes earlier than that of the successful task, then the driver do know nothing, and it has to discriminate between already committed and other committedDenied case from the statusUpdate of committedDenied alone. This case is possible when:
- successful task attempt 1 commit
- attempt 2 commit failed
- attempt 2's statusUpdate arrives at driver
- attempt 1's statusUpdate arrives at driver
There was a problem hiding this comment.
Doesn't a stage abort also cause tasks to show up as killed (due to "stage cancelled"?)
It seems to me that CommitDenied always implies the task is killed, in which case it might be fine to convert all CommitDeniedExceptions into TaskKilled.
Btw, there's a catch block below -- case CausedBy(cDE: CommitDeniedException) => which seems like the right place to be doing this handling.
|
sorry the case I was talking about is with a fetch failure. The true abort stage doesn't happen until it retries 4 times. in that mean time you can have tasks from the same stage (different attempts) running at the same time because we currently don't kill the tasks from the aborted stage. Although thinking about that more having them show up as killed doesn't hurt anything just making a bit bigger assumption. |
| @@ -459,7 +459,7 @@ private[spark] class Executor( | |||
| case CausedBy(cDE: CommitDeniedException) => | |||
| val reason = cDE.toTaskFailedReason | |||
There was a problem hiding this comment.
we should probably change this to be toTaskCommitDeniedReason since its not failed anymore.
|
ping @tgravescs |
|
thanks for the udpates. I was testing this out by running large job with speculative tasks and I am still seeing the stage summary show failed tasks. It looks like its due to this code: Where it doesn't differentiate the commit denied message so i think we need to handle it there so the stats show up properly. It would also be good to add a unit test for that where you can look at the stageData to make sure the numFailedTasks is what you expect. |
|
How about Letting TaskCommitDenied and TaskKilled extend a same trait (for example, TaskKilledReason)? (ignore the last commit for now, it seems bad for TaskCommitDenied extending TaskKilled directly) This way when accounting metrics, TaskCommitDenied and TaskKilled are all contributing to taskKilled and not TaskFailed. |
|
sorry for my delay on getting back to this. |
|
ping @liyichao Will you address the latest comments from @tgravescs ? |
|
I will update the pr by updating everywhere taskKilledReason used in a day, sorry for the delay. |
|
there is actually another pull request up that does this same thing: |
|
Oh, I did not notice that, since @nlyu follows up, I will close this pr now. |
## What changes were proposed in this pull request? In executor, toTaskFailedReason is converted to toTaskCommitDeniedReason to avoid the inconsistency of taskState. In JobProgressListener, add case TaskCommitDenied so that now the stage killed number is been incremented other than failed number. This pull request is picked up from: apache#18070 using commit: ff93ade The case match for TaskCommitDenied is added incrementing the correct num of killed after pull/18070. ## How was this patch tested? Run a normal speculative job and check the Stage UI page, should have no failed displayed. Author: louis lyu <llyu@c02tk24rg8wl-lm.champ.corp.yahoo.com> Closes apache#18819 from nlyu/SPARK-20713.
What changes were proposed in this pull request?
In executor,
CommitDeniedExceptionis converted toTaskKilledExceptionto avoid the inconsistency of taskState because there exists a race between when the driver kills and when the executor tries to commit.How was this patch tested?
No tests because it is straightforward.