[SPARK-29551][CORE] Fix a bug about fetch failed when an executor is lost #26206

weixiuli · 2019-10-22T11:16:07Z

What changes were proposed in this pull request?

When an executor lost with some reason and some things (e.g. the external shuffle service or host lost on the executor's host.) happened, and the executor loses time happens to be reduce stage fetch failed from it which is really bad, the previous only call mapOutputTracker.unregisterMapOutput(shuffleId, mapIndex, bmAddress) to mark clear shuffle status for the mapper at this time , but the other mappers shuffle status on the executor are also unnavailable and the DagScheduler does Not know that, so the reduce stages will fetch failed again when fetch them, the unavailable shuffle status can only be resubmitted by a nest retry stage which is the regression.

As we all know that the previous will call mapOutputTracker.removeOutputsOnHost(host) to mark clear shuffle status on the host or mapOutputTracker.removeOutputsOnExecutor(execId) to mark clear shuffle status on the executor when reduce stage fetches failed and the executor is active, while it does NOT nothing when the executor lost happened, which is really bad .

So we should distinguish the failedEpoch of 'executor lost' from the fetchFailedEpoch of 'fetch failed' to solve the above problem.

Why are the changes needed?

The regression has been described above.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Add unittests.

weixiuli · 2019-10-22T11:19:47Z

@cloud-fan @gatorsmile Kindly review, thanks.

AmplabJenkins · 2019-10-22T11:22:56Z

Can one of the admins verify this patch?

cloud-fan · 2019-10-22T12:07:26Z

Can you explain what is the regression? I have no idea what is the problem you are trying to fix.

weixiuli · 2019-10-23T02:21:21Z

When an executor lost with some reason and some things (eg:. the external shuffle service or host lost on the executor's host.) happened, and the executor loses time happens to be reduce stage fetch failed from it which is really bad, the previous only call mapOutputTracker.unregisterMapOutput(shuffleId, mapIndex, bmAddress) to mark one map as broken in the map stage at this time , but other maps on the executor are also not available which can only be resubmitted by a nest retry stage which is the regression.

As we all know that the previous will call mapOutputTracker.removeOutputsOnHost(host) or
mapOutputTracker.removeOutputsOnExecutor(execId) when reduce stage fetches failed and the executor is active, while it does NOT for the above problems.

So we should distinguish the failedEpoch of 'executor lost' from the fetchFailedEpoch of 'fetch failed' to solve the above problem.

@cloud-fan

…lost - There will be a regression when the executor lost and then causes 'fetch failed'. - Add fetchFailedEpoch to solve the above problem.

weixiuli · 2019-10-23T09:28:19Z

@cloud-fan @gatorsmile @dongjoon-hyun PTAL.

tgravescs · 2019-10-23T13:22:02Z

can you please add the description to the first message. Please change the title to be what is the description of change. Please specify if this is a regression from spark 2.4 or if you are proposing a new feature. You talk about the specific code in mapoutputtracker but can you please say what the user sees - does the job fail, does it hang, etc.,. fetch failed is a large category.

cloud-fan · 2019-10-23T17:02:27Z

Yea, please describe the problem from an end-user's perspective.

gatorsmile · 2019-10-23T17:29:10Z

cc @jiangxb1987 @Ngone51

dongjoon-hyun · 2019-10-23T18:32:57Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

  // TODO: Garbage collect information about failure epochs when we know there are no more
  //       stray messages to detect.
  private val failedEpoch = new HashMap[String, Long]
+  // There will be a regression when an executor lost and then causes 'fetch failed'.


There will be a regression is misleading because this mean this PR causes a regression.

dongjoon-hyun · 2019-10-23T18:34:46Z