TEZ-4338: Tez should consider node information to realize OUTPUT_LOST as early as possible - upstream(mapper) problems #152

abstractdog · 2021-10-01T15:29:38Z

The fix is actually to store downstreamBlamingHosts on vertex level. This is a map which simply stores a set of downstream hosts that reported fetch failure for every upstream host. The idea is to detect the upstream node failure better, and I'm trying to achieve this by assuming if downstream tasks from at least n nodes (n is configurable) reported failures for an upstream host, then the upstream tasks will be marked as OUTPUT_LOST and will rerun. As we're handling this on vertex level, we can fail a mapper task even if we get the first fetch failure reported blaming that, as we already figured out the mapper host itself has been reported as a possible root cause of failures.

Changes in this patch:

propagating host of the downstream task through tez event
handling hosts in the transition, actual fix is:

      if (currentNumberOfFailingDownstreamHosts > sourceAttempt.getVertex().getVertexConfig()
          .getMaxAllowedDownstreamHostsReportingFetchFailure()) {
        LOG.info("Host will be marked fail: {} because of {} distinct upstream hosts having fetch failures", sHost,
            currentNumberOfFailingDownstreamHosts);
        tooManyDownstreamHostsBlamedTheSameUpstreamHost = true;
      }

testing improvement, configurable fetch errors (on both ordered, unordered codepaths), e.g.

tez.runtime.shuffle.fetch.testing.errors.enable=true;
tez.runtime.shuffle.fetch.testing.errors.config=samplehost#100#fail_only_first

more examples in javadoc

rbalamohan · 2021-10-20T05:00:41Z

tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java

+   */
+  @ConfigurationScope(Scope.AM)
+  @ConfigurationProperty(type="integer")
+  public static final String TEZ_AM_MAX_ALLOWED_DOWNSTREAM_HOSTS_REPORTING_FETCH_FAILURE =


How do we determine this value for large cluster or dynamically changing cluster?

rbalamohan · 2021-10-20T05:09:43Z

tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskAttemptImpl.java

+          sHost, failedDestTaId, dHost, failedInputIndexOnDestTa);
+
+      boolean tooManyDownstreamHostsBlamedTheSameUpstreamHost = false;
+      Map<String, Set<String>> downstreamBlamingHosts = sourceAttempt.getVertex().getDownstreamBlamingHosts();


Wouldn't downstreamBlamingHosts structure occupy lot of memory in AM in large clusters? E.g think of a job running at 1000 node scale and running with 10 vertices. This will easily have 1000 tasks x 1000 nodes entries, which can lead to mem pressure on AM. Add number of vertices to this and it will be even more mem pressure.

How about tracking the downstream hostnames in a set and use that as optional dimension in the computation? This way, even if multiple task attempts were running the same host, it will be accounted as single failure (note that currently in master branch, it is accounted as multiple times).

To be more specific, is it possible to track downstream hostnames in a set and use that set.size() for computing the fraction to determine if src has to be re-executed or not?

@rbalamohan: please note that this map is stored on vertex level...

sourceAttempt.getVertex().getDownstreamBlamingHosts()

...not on task attempt level, so the memory occupied is proportional to no. of vertices, do you think there is still a mem pressure problem?

btw. the idea of storing on vertex level was that if we already detected a LOST_OUTPUT for source_attempt_0 from a source_host_0, then we immediately mark source_attempt_x on source_host_0 OUTPUT_LOST if an input_read_error comes in blaming source_attempt_x, please let me know if this makes sense

This way, even if multiple task attempts were running the same host, it will be accounted as single failure

I guess downstreamBlamingHosts works this way, by storing:

Map<upstreamHost, Set<downstreamHost>>

the amount of entries are independent of the number of task attempts

To be more specific, is it possible to track downstream hostnames in a set and use that set.size() for computing the fraction to determine if src has to be re-executed or not?

I can try, do you have any idea in particular? "track downstream hostnames" means all downstream hostnames that reported input read error? also I'm struggling to understand how to make hosts be considerable in a fraction, I mean, what to divide with what...I need to think this over, please let me know if you have an idea

Thought more about this again. Since this map would be populated mainly during error condition, there needs to be a lot of bad nodes to cause such case.

Should be ok to have this map. But the other concern is how to choose the right default config? 1 seems very restrictive, as source will bail out immediately on second downstream node reporting failure. Set it to a much higher value and on need basis this can be adjusted.

Can you check if this value can be adjusted at runtime? If so, this may need to be added in "confKeys" of respective inputs/outputs? E.g plz refer to UnorderedKVInput

sure, let me think about it
honestly, I neither liked this restrictive and absolute config (even if I tried to rationalize how it can help)

ContainerLauncherContext.getNumNodes seems to be okay, because it's updated at runtime: every time when a task is allocated, the node is marked as seen
fortunately, it should work for LLAP without a Hive change, because LlapTaskSchedulerService.scheduleTask takes care of updating it via TaskSchedulerManager.taskAllocated (like DagAwareYarnTaskScheduler and YarnTaskSchedulerService in Tez)

getContext().taskAllocated(taskInfo.task, taskInfo.clientCookie, container);

so it doesn't reflect all the nodes, instead, the ones that are being used, I think I can use this as a max, and use a hostFailureFraction as reportingDownstreamHosts / numNodes
let me know if it doesn't make sense, I'll try to experiment a bit

@rbalamohan: uploaded a new patch here c523ebb

tested on cluster, I found that instead of the number of nodes, we need to take care of active hosts only
it was an interesting situation where the AM was running, and I restarted the hive llap application, and as node removal is not implemented in AMNodeTracker, it turned out we need to consider only ACTIVE nodes, here is what I saw when I logged nodes:

2021-10-21 05:34:53,798 [INFO] [Dispatcher thread {Central}] |impl.TaskAttemptImpl|: nodes: {ccycloud-4.hive-runtime-perf.root.hwx.site:40783:0deaa785-cd96-45ad-ae1c-e5f9e44be73a={AMNodeImpl: nodeId: ccycloud-4.hive-runtime-perf.root.hwx.site:40783:0deaa785-cd96-45ad-ae1c-e5f9e44be73a, state: UNHEALTHY, containers: 0, completed containers: 0, healthy: false, blackListed: false}, ccycloud-8.hive-runtime-perf.root.hwx.site:39959:fce0bc31-a1a4-49ea-b40c-53ea84a783f6={AMNodeImpl: nodeId: ccycloud-8.hive-runtime-perf.root.hwx.site:39959:fce0bc31-a1a4-49ea-b40c-53ea84a783f6, state: ACTIVE, containers: 221, completed containers: 179, healthy: true, blackListed: false}, ccycloud-7.hive-runtime-perf.root.hwx.site:38523:1a2b75f4-1d87-4735-bea4-d96790ee5420={AMNodeImpl: nodeId: ccycloud-7.hive-runtime-perf.root.hwx.site:38523:1a2b75f4-1d87-4735-bea4-d96790ee5420, state: ACTIVE, containers: 223, completed containers: 181, healthy: true, blackListed: false}, ccycloud-5.hive-runtime-perf.root.hwx.site:35671:ea352491-39bb-4d81-b9c5-e519b87696b8={AMNodeImpl: nodeId: ccycloud-5.hive-runtime-perf.root.hwx.site:35671:ea352491-39bb-4d81-b9c5-e519b87696b8, state: UNHEALTHY, containers: 0, completed containers: 0, healthy: false, blackListed: false}, ccycloud-4.hive-runtime-perf.root.hwx.site:46577:d8f0caff-790e-453a-96ac-14964d660f7e={AMNodeImpl: nodeId: ccycloud-4.hive-runtime-perf.root.hwx.site:46577:d8f0caff-790e-453a-96ac-14964d660f7e, state: ACTIVE, containers: 236, completed containers: 194, healthy: true, blackListed: false}, ccycloud-3.hive-runtime-perf.root.hwx.site:34353:69d5de51-df04-481b-9aad-a6b88a2a33c9={AMNodeImpl: nodeId: ccycloud-3.hive-runtime-perf.root.hwx.site:34353:69d5de51-df04-481b-9aad-a6b88a2a33c9, state: UNHEALTHY, containers: 0, completed containers: 0, healthy: false, blackListed: false}, ccycloud-6.hive-runtime-perf.root.hwx.site:44728:187f814d-9760-4965-881f-9dac9f4a2ae5={AMNodeImpl: nodeId: ccycloud-6.hive-runtime-perf.root.hwx.site:44728:187f814d-9760-4965-881f-9dac9f4a2ae5, state: UNHEALTHY, containers: 0, completed containers: 0, healthy: false, blackListed: false}, ccycloud-9.hive-runtime-perf.root.hwx.site:42123:dcb2622b-fed3-4f88-a4e2-70d0ebc57a5e={AMNodeImpl: nodeId: ccycloud-9.hive-runtime-perf.root.hwx.site:42123:dcb2622b-fed3-4f88-a4e2-70d0ebc57a5e, state: ACTIVE, containers: 230, completed containers: 188, healthy: true, blackListed: false}}

2021-10-21 05:34:53,798 [INFO] [Dispatcher thread {Central}] |impl.TaskAttemptImpl|: currentNumberOfFailingDownstreamHosts: 1, numNodes: 4, fraction: 0.25, max allowed: 0.2

this log message was made when numNodes: 4 was showing only state:ACTIVE nodes, but I used a temporary log message "nodes:", you can see 4 running LLAP daemons and 4 old, which are shown as UNHEALTHY instead of ACTIVE

tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java

tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskAttemptImpl.java

… as early as possible - upstream(mapper) problems Change-Id: I8907507b0527a65e6492fb7d83ff55b0f17bd7a4

…t injectable errors

tez-yetus · 2021-10-22T10:08:35Z

🎊 +1 overall

Vote	Subsystem	Runtime	Comment
+0 🆗	reexec	1m 2s	Docker mode activated.
		_ Prechecks _
+1 💚	dupname	0m 0s	No case conflicting files found.
+0 🆗	prototool	0m 0s	prototool was not available.
+1 💚	@author	0m 0s	The patch does not contain any @author tags.
+1 💚	test4tests	0m 0s	The patch appears to include 1 new or modified test files.
		_ master Compile Tests _
+0 🆗	mvndep	4m 19s	Maven dependency ordering for branch
+1 💚	mvninstall	9m 20s	master passed
+1 💚	compile	2m 7s	master passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚	compile	1m 53s	master passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 💚	checkstyle	1m 57s	master passed
+1 💚	javadoc	2m 5s	master passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚	javadoc	1m 50s	master passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+0 🆗	spotbugs	1m 15s	Used deprecated FindBugs config; considering switching to SpotBugs.
+1 💚	findbugs	4m 12s	master passed
		_ Patch Compile Tests _
+0 🆗	mvndep	0m 9s	Maven dependency ordering for patch
+1 💚	mvninstall	1m 21s	the patch passed
+1 💚	compile	1m 30s	the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚	cc	1m 30s	the patch passed
+1 💚	javac	1m 30s	the patch passed
+1 💚	compile	1m 18s	the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 💚	cc	1m 18s	the patch passed
+1 💚	javac	1m 18s	the patch passed
-0 ⚠️	checkstyle	0m 16s	tez-runtime-library: The patch generated 6 new + 131 unchanged - 2 fixed = 137 total (was 133)
-0 ⚠️	checkstyle	0m 27s	tez-dag: The patch generated 1 new + 543 unchanged - 0 fixed = 544 total (was 543)
+1 💚	whitespace	0m 0s	The patch has no whitespace issues.
+1 💚	javadoc	1m 18s	the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚	javadoc	1m 12s	the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 💚	findbugs	3m 47s	the patch passed
		_ Other Tests _
+1 💚	unit	1m 54s	tez-api in the patch passed.
+1 💚	unit	0m 32s	tez-runtime-internals in the patch passed.
+1 💚	unit	4m 49s	tez-runtime-library in the patch passed.
+1 💚	unit	4m 6s	tez-dag in the patch passed.
+1 💚	asflicense	0m 34s	The patch does not generate ASF License warnings.
		53m 58s

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-152/6/artifact/out/Dockerfile
GITHUB PR	#152
JIRA Issue	TEZ-4338
Optional Tests	dupname asflicense javac javadoc unit spotbugs findbugs checkstyle compile cc prototool
uname	Linux 18851ace3bd4 4.15.0-143-generic #147-Ubuntu SMP Wed Apr 14 16:10:11 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	personality/tez.sh
git revision	master / `58fca8b`
Default Java	Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
checkstyle	https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-152/6/artifact/out/diff-checkstyle-tez-runtime-library.txt
checkstyle	https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-152/6/artifact/out/diff-checkstyle-tez-dag.txt
Test Results	https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-152/6/testReport/
Max. process+thread count	394 (vs. ulimit of 5500)
modules	C: tez-api tez-runtime-internals tez-runtime-library tez-dag U: .
Console output	https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-152/6/console
versions	git=2.25.1 maven=3.6.3 findbugs=3.0.1
Powered by	Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

abstractdog · 2021-11-15T21:20:39Z

@rbalamohan: is there anything that's needed from this PR to get it committed? thanks in advance

rbalamohan · 2021-11-18T22:30:54Z

tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java

+  @ConfigurationProperty(type="integer")
+  public static final String TEZ_AM_MAX_ALLOWED_DOWNSTREAM_HOST_FAILURES_FRACTION =
+      TEZ_AM_PREFIX + "max.allowed.downstream.host.failures.fraction";
+  public static final double TEZ_AM_MAX_ALLOWED_DOWNSTREAM_HOST_FAILURES_FRACTION_DEFAULT = 0.2;


Mostly looks good. This may not work for small clusters (e.g < 5 nodes) where single node failure can cause source to restart the task. This can be tweaked later.

I see, I'm assuming on that small cluster a fraction: 0.25 might work properly (so in case of 4 hosts, 1 failing downstream won't make the source restart immediately, at least 2 downstream reporting hosts are needed)

rbalamohan · 2021-11-18T22:32:32Z

LGTM. +1.

This comment has been minimized.

Sign in to view

abstractdog force-pushed the TEZ-4338 branch from 2e4a016 to e4c7f02 Compare October 5, 2021 11:08

This comment has been minimized.

Sign in to view

abstractdog requested a review from rbalamohan October 6, 2021 07:46

abstractdog force-pushed the TEZ-4338 branch from e4c7f02 to 7ea13c9 Compare October 7, 2021 06:33

This comment has been minimized.

Sign in to view

rbalamohan reviewed Oct 20, 2021

View reviewed changes

TEZ-4338: Tez should consider node information to realize OUTPUT_LOST…

e129d67

… as early as possible - upstream(mapper) problems Change-Id: I8907507b0527a65e6492fb7d83ff55b0f17bd7a4

abstractdog force-pushed the TEZ-4338 branch from 7ea13c9 to 90a0ace Compare October 22, 2021 07:06

This comment has been minimized.

Sign in to view

TEZ-4338: host fraction considering active nodes, src vertex filter a…

c523ebb

…t injectable errors

abstractdog force-pushed the TEZ-4338 branch from 90a0ace to c523ebb Compare October 22, 2021 09:14

rbalamohan reviewed Nov 18, 2021

View reviewed changes

abstractdog merged commit 211b59b into apache:master Nov 19, 2021

tez-yetus mentioned this pull request Jul 22, 2025

TEZ-4627: Apache Tez Release 0.10.5 - site markdown changes #416

Merged

tez-yetus mentioned this pull request Sep 15, 2025

TEZ-4648: Upgrade jersey to 2.x #429

Open

TEZ-4338: Tez should consider node information to realize OUTPUT_LOST as early as possible - upstream(mapper) problems #152

TEZ-4338: Tez should consider node information to realize OUTPUT_LOST as early as possible - upstream(mapper) problems #152

Uh oh!

Conversation

abstractdog commented Oct 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

rbalamohan Oct 20, 2021

Choose a reason for hiding this comment

Uh oh!

rbalamohan Oct 20, 2021

Choose a reason for hiding this comment

Uh oh!

abstractdog Oct 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rbalamohan Oct 20, 2021

Choose a reason for hiding this comment

Uh oh!

abstractdog Oct 20, 2021

Choose a reason for hiding this comment

Uh oh!

abstractdog Oct 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abstractdog Oct 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

tez-yetus commented Oct 22, 2021

Uh oh!

abstractdog commented Nov 15, 2021

Uh oh!

rbalamohan Nov 18, 2021

Choose a reason for hiding this comment

Uh oh!

abstractdog Nov 19, 2021

Choose a reason for hiding this comment

Uh oh!

rbalamohan commented Nov 18, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

abstractdog commented Oct 1, 2021 •

edited

Loading

abstractdog Oct 20, 2021 •

edited

Loading

abstractdog Oct 20, 2021 •

edited

Loading

abstractdog Oct 22, 2021 •

edited

Loading