-
Notifications
You must be signed in to change notification settings - Fork 440
TEZ-4338: Tez should consider node information to realize OUTPUT_LOST as early as possible - upstream(mapper) problems #152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -194,7 +194,7 @@ public static DataEventDependencyInfo fromProto(DataEventDependencyInfoProto pro | |
| private Container container; | ||
| private long allocationTime; | ||
| private ContainerId containerId; | ||
| private NodeId containerNodeId; | ||
| protected NodeId containerNodeId; | ||
| private String nodeHttpAddress; | ||
| private String nodeRackName; | ||
|
|
||
|
|
@@ -1793,85 +1793,130 @@ protected static class OutputReportedFailedTransition implements | |
| MultipleArcTransition<TaskAttemptImpl, TaskAttemptEvent, TaskAttemptStateInternal> { | ||
|
|
||
| @Override | ||
| public TaskAttemptStateInternal transition(TaskAttemptImpl attempt, | ||
| public TaskAttemptStateInternal transition(TaskAttemptImpl sourceAttempt, | ||
| TaskAttemptEvent event) { | ||
| TaskAttemptEventOutputFailed outputFailedEvent = | ||
| (TaskAttemptEventOutputFailed) event; | ||
| TezEvent tezEvent = outputFailedEvent.getInputFailedEvent(); | ||
| TezTaskAttemptID failedDestTaId = tezEvent.getSourceInfo().getTaskAttemptID(); | ||
| InputReadErrorEvent readErrorEvent = (InputReadErrorEvent)tezEvent.getEvent(); | ||
| TezEvent inputFailedEvent = outputFailedEvent.getInputFailedEvent(); | ||
| TezTaskAttemptID failedDestTaId = inputFailedEvent.getSourceInfo().getTaskAttemptID(); | ||
|
|
||
| InputReadErrorEvent readErrorEvent = (InputReadErrorEvent)inputFailedEvent.getEvent(); | ||
| int failedInputIndexOnDestTa = readErrorEvent.getIndex(); | ||
| if (readErrorEvent.getVersion() != attempt.getID().getId()) { | ||
| throw new TezUncheckedException(attempt.getID() | ||
|
|
||
| if (readErrorEvent.getVersion() != sourceAttempt.getID().getId()) { | ||
| throw new TezUncheckedException(sourceAttempt.getID() | ||
| + " incorrectly blamed for read error from " + failedDestTaId | ||
| + " at inputIndex " + failedInputIndexOnDestTa + " version" | ||
| + readErrorEvent.getVersion()); | ||
| } | ||
| LOG.info(attempt.getID() | ||
| + " blamed for read error from " + failedDestTaId | ||
| + " at inputIndex " + failedInputIndexOnDestTa); | ||
| long time = attempt.clock.getTime(); | ||
| Long firstErrReportTime = attempt.uniquefailedOutputReports.get(failedDestTaId); | ||
| // source host: where the data input is supposed to come from | ||
| String sHost = sourceAttempt.getNodeId().getHost(); | ||
| // destination: where the data is tried to be fetched to | ||
| String dHost = readErrorEvent.getDestinationLocalhostName(); | ||
|
|
||
| LOG.info("{} (on {}) blamed for read error from {} (on {}) at inputIndex {}", sourceAttempt.getID(), | ||
| sHost, failedDestTaId, dHost, failedInputIndexOnDestTa); | ||
|
|
||
| boolean tooManyDownstreamHostsBlamedTheSameUpstreamHost = false; | ||
| Map<String, Set<String>> downstreamBlamingHosts = sourceAttempt.getVertex().getDownstreamBlamingHosts(); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wouldn't downstreamBlamingHosts structure occupy lot of memory in AM in large clusters? E.g think of a job running at 1000 node scale and running with 10 vertices. This will easily have 1000 tasks x 1000 nodes entries, which can lead to mem pressure on AM. Add number of vertices to this and it will be even more mem pressure. How about tracking the downstream hostnames in a set and use that as optional dimension in the computation? This way, even if multiple task attempts were running the same host, it will be accounted as single failure (note that currently in master branch, it is accounted as multiple times). To be more specific, is it possible to track downstream hostnames in a set and use that set.size() for computing the fraction to determine if src has to be re-executed or not?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @rbalamohan: please note that this map is stored on vertex level... ...not on task attempt level, so the memory occupied is proportional to no. of vertices, do you think there is still a mem pressure problem? btw. the idea of storing on vertex level was that if we already detected a LOST_OUTPUT for source_attempt_0 from a source_host_0, then we immediately mark source_attempt_x on source_host_0 OUTPUT_LOST if an input_read_error comes in blaming source_attempt_x, please let me know if this makes sense
I guess downstreamBlamingHosts works this way, by storing: the amount of entries are independent of the number of task attempts
I can try, do you have any idea in particular? "track downstream hostnames" means all downstream hostnames that reported input read error? also I'm struggling to understand how to make hosts be considerable in a fraction, I mean, what to divide with what...I need to think this over, please let me know if you have an idea
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thought more about this again. Since this map would be populated mainly during error condition, there needs to be a lot of bad nodes to cause such case. Should be ok to have this map. But the other concern is how to choose the right default config? 1 seems very restrictive, as source will bail out immediately on second downstream node reporting failure. Set it to a much higher value and on need basis this can be adjusted. Can you check if this value can be adjusted at runtime? If so, this may need to be added in "confKeys" of respective inputs/outputs? E.g plz refer to UnorderedKVInput
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sure, let me think about it
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ContainerLauncherContext.getNumNodes seems to be okay, because it's updated at runtime: every time when a task is allocated, the node is marked as seen so it doesn't reflect all the nodes, instead, the ones that are being used, I think I can use this as a max, and use a hostFailureFraction as reportingDownstreamHosts / numNodes
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @rbalamohan: uploaded a new patch here c523ebb tested on cluster, I found that instead of the number of nodes, we need to take care of active hosts only
this log message was made when numNodes: 4 was showing only state:ACTIVE nodes, but I used a temporary log message "nodes:", you can see 4 running LLAP daemons and 4 old, which are shown as UNHEALTHY instead of ACTIVE |
||
| if (!downstreamBlamingHosts.containsKey(sHost)) { | ||
| LOG.info("Host {} is blamed for fetch failure from {} for the first time", sHost, dHost); | ||
| downstreamBlamingHosts.put(sHost, new HashSet<String>()); | ||
| } | ||
|
|
||
| downstreamBlamingHosts.get(sHost).add(dHost); | ||
| int currentNumberOfFailingDownstreamHosts = downstreamBlamingHosts.get(sHost).size(); | ||
| int numNodes = getNumNodes(sourceAttempt); | ||
| float hostFailureFraction = numNodes > 0 ? ((float) currentNumberOfFailingDownstreamHosts) / numNodes : 0; | ||
| double maxAllowedHostFailureFraction = sourceAttempt.getVertex().getVertexConfig() | ||
| .getMaxAllowedDownstreamHostFailuresFraction(); | ||
|
|
||
| if (hostFailureFraction > maxAllowedHostFailureFraction) { | ||
| LOG.info("Host will be marked fail: {} because of host failure fraction {} is beyond the limit {}", sHost, | ||
| hostFailureFraction, maxAllowedHostFailureFraction); | ||
| tooManyDownstreamHostsBlamedTheSameUpstreamHost = true; | ||
| } | ||
|
|
||
| long time = sourceAttempt.clock.getTime(); | ||
|
|
||
| Long firstErrReportTime = sourceAttempt.uniquefailedOutputReports.get(failedDestTaId); | ||
| if (firstErrReportTime == null) { | ||
| attempt.uniquefailedOutputReports.put(failedDestTaId, time); | ||
| sourceAttempt.uniquefailedOutputReports.put(failedDestTaId, time); | ||
| firstErrReportTime = time; | ||
| } | ||
|
|
||
| int maxAllowedOutputFailures = attempt.getVertex().getVertexConfig() | ||
| int maxAllowedOutputFailures = sourceAttempt.getVertex().getVertexConfig() | ||
abstractdog marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| .getMaxAllowedOutputFailures(); | ||
| int maxAllowedTimeForTaskReadErrorSec = attempt.getVertex() | ||
| int maxAllowedTimeForTaskReadErrorSec = sourceAttempt.getVertex() | ||
| .getVertexConfig().getMaxAllowedTimeForTaskReadErrorSec(); | ||
| double maxAllowedOutputFailuresFraction = attempt.getVertex() | ||
| double maxAllowedOutputFailuresFraction = sourceAttempt.getVertex() | ||
| .getVertexConfig().getMaxAllowedOutputFailuresFraction(); | ||
|
|
||
| int readErrorTimespanSec = (int)((time - firstErrReportTime)/1000); | ||
| boolean crossTimeDeadline = readErrorTimespanSec >= maxAllowedTimeForTaskReadErrorSec; | ||
|
|
||
| int runningTasks = attempt.appContext.getCurrentDAG().getVertex( | ||
| int runningTasks = sourceAttempt.appContext.getCurrentDAG().getVertex( | ||
| failedDestTaId.getTaskID().getVertexID()).getRunningTasks(); | ||
| float failureFraction = runningTasks > 0 ? ((float) attempt.uniquefailedOutputReports.size()) / runningTasks : 0; | ||
| float failureFraction = | ||
| runningTasks > 0 ? ((float) sourceAttempt.uniquefailedOutputReports.size()) / runningTasks : 0; | ||
| boolean withinFailureFractionLimits = | ||
| (failureFraction <= maxAllowedOutputFailuresFraction); | ||
| boolean withinOutputFailureLimits = | ||
| (attempt.uniquefailedOutputReports.size() < maxAllowedOutputFailures); | ||
| (sourceAttempt.uniquefailedOutputReports.size() < maxAllowedOutputFailures); | ||
|
|
||
| // If needed we can launch a background task without failing this task | ||
| // to generate a copy of the output just in case. | ||
| // If needed we can consider only running consumer tasks | ||
| if (!crossTimeDeadline && withinFailureFractionLimits && withinOutputFailureLimits | ||
| && !(readErrorEvent.isLocalFetch() || readErrorEvent.isDiskErrorAtSource())) { | ||
| return attempt.getInternalState(); | ||
| && !(readErrorEvent.isLocalFetch() || readErrorEvent.isDiskErrorAtSource()) | ||
| && !tooManyDownstreamHostsBlamedTheSameUpstreamHost) { | ||
| return sourceAttempt.getInternalState(); | ||
| } | ||
| String message = attempt.getID() + " being failed for too many output errors. " | ||
| String message = sourceAttempt.getID() + " being failed for too many output errors. " | ||
| + "failureFraction=" + failureFraction | ||
| + ", MAX_ALLOWED_OUTPUT_FAILURES_FRACTION=" | ||
| + maxAllowedOutputFailuresFraction | ||
| + ", uniquefailedOutputReports=" + attempt.uniquefailedOutputReports.size() | ||
| + ", uniquefailedOutputReports=" + sourceAttempt.uniquefailedOutputReports.size() | ||
| + ", MAX_ALLOWED_OUTPUT_FAILURES=" + maxAllowedOutputFailures | ||
| + ", hostFailureFraction=" + hostFailureFraction | ||
| + " (" + currentNumberOfFailingDownstreamHosts + " / " + numNodes + ")" | ||
| + ", MAX_ALLOWED_DOWNSTREAM_HOST_FAILURES_FRACTION=" | ||
| + maxAllowedHostFailureFraction | ||
| + ", MAX_ALLOWED_TIME_FOR_TASK_READ_ERROR_SEC=" | ||
| + maxAllowedTimeForTaskReadErrorSec | ||
| + ", readErrorTimespan=" + readErrorTimespanSec | ||
| + ", isLocalFetch=" + readErrorEvent.isLocalFetch() | ||
| + ", isDiskErrorAtSource=" + readErrorEvent.isDiskErrorAtSource(); | ||
|
|
||
| LOG.info(message); | ||
| attempt.addDiagnosticInfo(message); | ||
| sourceAttempt.addDiagnosticInfo(message); | ||
| // send input failed event | ||
| attempt.sendInputFailedToConsumers(); | ||
| sourceAttempt.sendInputFailedToConsumers(); | ||
| // Not checking for leafVertex since a READ_ERROR should only be reported for intermediate tasks. | ||
| if (attempt.getInternalState() == TaskAttemptStateInternal.SUCCEEDED) { | ||
| if (sourceAttempt.getInternalState() == TaskAttemptStateInternal.SUCCEEDED) { | ||
| (new TerminatedAfterSuccessHelper(FAILED_HELPER)).transition( | ||
| attempt, event); | ||
| sourceAttempt, event); | ||
| return TaskAttemptStateInternal.FAILED; | ||
| } else { | ||
| (new TerminatedWhileRunningTransition(FAILED_HELPER)).transition( | ||
| attempt, event); | ||
| sourceAttempt, event); | ||
| return TaskAttemptStateInternal.FAIL_IN_PROGRESS; | ||
| } | ||
| // TODO at some point. Nodes may be interested in FetchFailure info. | ||
| // Can be used to blacklist nodes. | ||
| } | ||
|
|
||
| private int getNumNodes(TaskAttemptImpl sourceAttempt) { | ||
| Vertex vertex = sourceAttempt.getVertex(); | ||
| String taskSchedulerName = vertex.getServicePluginInfo().getTaskSchedulerName(); | ||
| int sourceIndex = vertex.getAppContext().getTaskScheduerIdentifier(taskSchedulerName); | ||
| int numActiveNodes = vertex.getAppContext().getNodeTracker().getNumActiveNodes(sourceIndex); | ||
| if (LOG.isDebugEnabled()) { | ||
| int numAllNodes = vertex.getAppContext().getNodeTracker().getNumNodes(sourceIndex); | ||
| LOG.debug("Getting nodes, active/all: {}/{}", numActiveNodes, numAllNodes); | ||
| } | ||
| return numActiveNodes; | ||
| } | ||
| } | ||
|
|
||
| @VisibleForTesting | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good. This may not work for small clusters (e.g < 5 nodes) where single node failure can cause source to restart the task. This can be tweaked later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, I'm assuming on that small cluster a fraction: 0.25 might work properly (so in case of 4 hosts, 1 failing downstream won't make the source restart immediately, at least 2 downstream reporting hosts are needed)