Skip to content

Conversation

@squito
Copy link
Contributor

@squito squito commented Apr 17, 2018

What changes were proposed in this pull request?

SparkContext submitted a map stage from submitMapStage to DAGScheduler,
markMapStageJobAsFinished is called only in (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L933 and https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1314);

But think about below scenario:

  1. stage0 and stage1 are all ShuffleMapStage and stage1 depends on stage0;
  2. We submit stage1 by submitMapStage;
  3. When stage 1 running, FetchFailed happened, stage0 and stage1 got resubmitted as stage0_1 and stage1_1;
  4. When stage0_1 running, speculated tasks in old stage1 come as succeeded, but stage1 is not inside runningStages. So even though all splits(including the speculated tasks) in stage1 succeeded, job listener in stage1 will not be called;
  5. stage0_1 finished, stage1_1 starts running. When submitMissingTasks, there is no missing tasks. But in current code, job listener is not triggered.

We should call the job listener for map stage in 5.

How was this patch tested?

Not added yet.

Author: jinxing [email protected]

(cherry picked from commit 3990daa)

## What changes were proposed in this pull request?

SparkContext submitted a map stage from `submitMapStage` to `DAGScheduler`,
`markMapStageJobAsFinished` is called only in (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L933 and https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1314);

But think about below scenario:
1. stage0 and stage1 are all `ShuffleMapStage` and stage1 depends on stage0;
2. We submit stage1 by `submitMapStage`;
3. When stage 1 running, `FetchFailed` happened, stage0 and stage1 got resubmitted as stage0_1 and stage1_1;
4. When stage0_1 running, speculated tasks in old stage1 come as succeeded, but stage1 is not inside `runningStages`. So even though all splits(including the speculated tasks) in stage1 succeeded, job listener in stage1 will not be called;
5. stage0_1 finished, stage1_1 starts running. When `submitMissingTasks`, there is no missing tasks. But in current code, job listener is not triggered.

We should call the job listener for map stage in `5`.

## How was this patch tested?

Not added yet.

Author: jinxing <[email protected]>

Closes apache#21019 from jinxing64/SPARK-23948.

(cherry picked from commit 3990daa)
@squito
Copy link
Contributor Author

squito commented Apr 17, 2018

clean cherry-pick of #21019, I just forgot to merge back to 2.3

@jiangxb1987
Copy link
Contributor

LGTM!

@SparkQA
Copy link

SparkQA commented Apr 17, 2018

Test build #89457 has finished for PR 21085 at commit 35e349f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@squito
Copy link
Contributor Author

squito commented Apr 17, 2018

known flaky test https://issues.apache.org/jira/browse/SPARK-23894

merging to branch 2.3

asfgit pushed a commit that referenced this pull request Apr 17, 2018
## What changes were proposed in this pull request?

SparkContext submitted a map stage from `submitMapStage` to `DAGScheduler`,
`markMapStageJobAsFinished` is called only in (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L933 and https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1314);

But think about below scenario:
1. stage0 and stage1 are all `ShuffleMapStage` and stage1 depends on stage0;
2. We submit stage1 by `submitMapStage`;
3. When stage 1 running, `FetchFailed` happened, stage0 and stage1 got resubmitted as stage0_1 and stage1_1;
4. When stage0_1 running, speculated tasks in old stage1 come as succeeded, but stage1 is not inside `runningStages`. So even though all splits(including the speculated tasks) in stage1 succeeded, job listener in stage1 will not be called;
5. stage0_1 finished, stage1_1 starts running. When `submitMissingTasks`, there is no missing tasks. But in current code, job listener is not triggered.

We should call the job listener for map stage in `5`.

## How was this patch tested?

Not added yet.

Author: jinxing <jinxing6042126.com>

(cherry picked from commit 3990daa)

Author: jinxing <[email protected]>

Closes #21085 from squito/cp.
@jiangxb1987
Copy link
Contributor

Should we manually close this PR? @squito

@squito squito closed this Apr 19, 2018
@squito
Copy link
Contributor Author

squito commented Apr 19, 2018

whoops, thanks for the reminder @jiangxb1987

rdblue pushed a commit to rdblue/spark that referenced this pull request Apr 3, 2019
## What changes were proposed in this pull request?

SparkContext submitted a map stage from `submitMapStage` to `DAGScheduler`,
`markMapStageJobAsFinished` is called only in (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L933 and https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1314);

But think about below scenario:
1. stage0 and stage1 are all `ShuffleMapStage` and stage1 depends on stage0;
2. We submit stage1 by `submitMapStage`;
3. When stage 1 running, `FetchFailed` happened, stage0 and stage1 got resubmitted as stage0_1 and stage1_1;
4. When stage0_1 running, speculated tasks in old stage1 come as succeeded, but stage1 is not inside `runningStages`. So even though all splits(including the speculated tasks) in stage1 succeeded, job listener in stage1 will not be called;
5. stage0_1 finished, stage1_1 starts running. When `submitMissingTasks`, there is no missing tasks. But in current code, job listener is not triggered.

We should call the job listener for map stage in `5`.

## How was this patch tested?

Not added yet.

Author: jinxing <jinxing6042126.com>

(cherry picked from commit 3990daa)

Author: jinxing <[email protected]>

Closes apache#21085 from squito/cp.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants