Skip to content

Conversation

@abstractdog
Copy link
Contributor

@abstractdog abstractdog commented May 18, 2021

Problems fixed:

  1. DagInfo.getEvents() returns an array with only 1 element
  2. All of the events (DagInfo.getEvents(), VertexInfo,getEvents()) have "0" as timestamp
    3. HistoryEventProtoJsonConversion: TASK_FINISHED event and VERTEX_FINISHED events don't contain starTime, only timeTaken, so as timeTaken is fix, startTime should be derivated from that, not the opposite way (this caused non-sense task durations in analyzers while parsing proto history files)
  3. Fix an NPE in TaskAttemptResultStatisticsAnalyzer
  4. float truncation problem in SkewAnalyzer
  5. counters format workaround for TEZ-4324

Refactoring:
removed configuration object from analyzers as TezAnalyzerBase is already a Configured class

New analyzers:
InputReadErrorAnalyzer
DagOverviewAnalyzer
TaskHangAnalyzer

attached example excel sheets generated with the analyzers to jira

@abstractdog
Copy link
Contributor Author

@rbalamohan: could you please take a look? new analyzers + minor bugfixes

@abstractdog abstractdog requested a review from rbalamohan May 18, 2021 10:10
@abstractdog
Copy link
Contributor Author

@rbalamohan : ping, if you have some cycles to review this fix/improvement :)

@hadoop-yetus

This comment has been minimized.

@hadoop-yetus

This comment has been minimized.

Copy link
Contributor

@rbalamohan rbalamohan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @abstractdog for the patch. Added review comments.

}

// attempt_1599682376162_0006_27_00_000086_1
int attemptNumber = Integer.parseInt(attempt.getTaskAttemptId().split("_")[6]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace with "TezTaskAttemptID.fromString(attempt.getTaskAttemptId()).getId()}" ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, right, what a hack this parseInt was :)

// attempt_1599682376162_0006_27_00_000086_1
int attemptNumber = Integer.parseInt(attempt.getTaskAttemptId().split("_")[6]);
if (attemptNumber == numAttemptsForTask - 1) {
thisTaskData.put("last_attempt_id", attempt.getTaskAttemptId());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Declare as static final Strings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure


// attempt_1599682376162_0006_27_00_000086_1
int attemptNumber = Integer.parseInt(attempt.getTaskAttemptId().split("_")[6]);
if (attemptNumber == numAttemptsForTask - 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes all the attempts may get scheduled on the same node and fail. It will be good to understand that as well. Would you like to refactor it such that, it can provide all id/status and respective node details?.

While providing final detail, it can be a concatenated string as well (to make it readable and printable easily)

It will be nice to provide the info in this analyzer itself; (Without this info, user may have to co-relate with results of TaskAssignmentAnalyzer for node analysis).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, adding status in "id/status" format to the "last_attempt_id" column
node info is already there in the last column: "last_attempt_node"


@Override
public String getDescription() {
return "TaskHandAnalyzer can give quick insights about hanging tasks/task attempts"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixing to simply "hanging task attempts"

/**
* Get the Task assignments on different nodes of the cluster.
*/
public class TaskHangAnalyzer extends TezAnalyzerBase implements Analyzer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename as "HungTaskAnalyzer" ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, makes sense

import java.util.Map;

/**
* Get the Task assignments on different nodes of the cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix comments as this analyser is related to hung task analysis

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

/**
* This analyzer is support to collect which nodes can be blamed for shuffle read errors.
*/
public class InputReadErrorAnalyzer extends TezAnalyzerBase implements Analyzer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, also I'm fixing "is support to collect", now it doesn't seem to be correct somehow


JSONObject otherInfo = new JSONObject();
otherInfo.put(ATSConstants.START_TIME, startTime);
otherInfo.put(ATSConstants.START_TIME, event.getEventTime() - timeTaken);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, there were corner cases where events will not be properly populated. May be in cases, where vertices were shutdown due to errors or so (need to check).

In such cases, this would have returned "-ve" value earlier.

Current patch seem to change the start_time, depending on getEventTime. This could give a perspective that the task/vertex was there for very short time.

Can you plz share more info on prev error? Were you getting -ve values earlier for which this is being modified?

Copy link
Contributor Author

@abstractdog abstractdog Aug 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, I saw this for all protobuf history files, that's why I'm not suspecting corner case
I found that in case of a TASK_FINISHED finished event, there is always an event time (which is the end time obviously) and timeTaken in event_data, but there is no startTime there

this string is what the debugger writes for a HistoryLoggerProtos$HistoryEventProto instance while doing this conversion:

event_type: "TASK_FINISHED"
event_time: 1628149977709
app_id: "application_1628051798891_0030"
dag_id: "dag_1628051798891_0030_1"
vertex_id: "vertex_1628051798891_0030_1_00"
task_id: "task_1628051798891_0030_1_00_000001"
event_data {
  key: "timeTaken"
  value: "4193"
}
event_data {
  key: "status"
  value: "SUCCEEDED"
}
event_data {
  key: "numFailedTaskAttempts"
  value: "0"
}
event_data {
  key: "successfulAttemptId"
  value: "attempt_1628051798891_0030_1_00_000001_0"
}
event_data {
  key: "diagnostics"
  value: ""
}
event_data {
  key: "counters"
  value: "..."
}

the root cause of this behavior would be:
https://github.com/apache/tez/blob/master/tez-plugins/tez-protobuf-history-plugin/src/main/java/org/apache/tez/dag/history/logging/proto/HistoryEventProtoConverter.java#L392

  private HistoryEventProto convertTaskFinishedEvent(TaskFinishedEvent event) {
    HistoryEventProto.Builder builder = makeBuilderForEvent(event, event.getFinishTime(),
        null, null, null, null, event.getTaskID(), null, null);

    addEventData(builder, ATSConstants.TIME_TAKEN, (event.getFinishTime() - event.getStartTime()));

here I can see the builder consumes only event.getFinishTime() for the "time" parameter, and startTime is shipped indirectly...according to blame, this code part is unchanged since the introduction of proto history logger (TEZ-3915)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the note. Earlier code didn't populate START_TIME (& had only timeTaken) causing the issue.

// if DAG_PLAN is not filled already, let's try to fetch it from other
dagJson.getJSONObject(ATSConstants.OTHER_INFO).put(ATSConstants.DAG_PLAN, jsonObject
.getJSONObject(ATSConstants.OTHER_INFO).getJSONObject(ATSConstants.DAG_PLAN));
} else{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: fix indent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure


public class DagOverviewAnalyzer extends TezAnalyzerBase implements Analyzer {
private final String[] headers =
{ "name", "id", "event_type", "status", "event_time", "event_time_str", "diagnostics" };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to include the number of tasks assigned in the vertex as well (can be added in another field called "comments" or "additional info" which can be populated optionally).
e.g "numTasks: " vertex.getNumTasks() + ", failedTasks: " + vertex.getFailedTasks().size()
+ ", completedTasks: " + vertex.getCompletedTasksCount()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, I'm adding a "vertex_task_stats" before diagnostics column for better readability

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hadoop-yetus

This comment has been minimized.

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 33s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 1s No case conflicting files found.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
-1 ❌ test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ master Compile Tests _
+0 🆗 mvndep 4m 21s Maven dependency ordering for branch
+1 💚 mvninstall 8m 52s master passed
+1 💚 compile 1m 32s master passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚 compile 1m 29s master passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 💚 checkstyle 1m 27s master passed
+1 💚 javadoc 1m 31s master passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚 javadoc 1m 17s master passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+0 🆗 spotbugs 0m 44s Used deprecated FindBugs config; considering switching to SpotBugs.
+0 🆗 findbugs 0m 42s tez-tools/analyzers/job-analyzer in master has 4 extant findbugs warnings.
-0 ⚠️ patch 1m 4s Used diff version of patch file. Binary files and potentially other changes not applied. Please rebase and squash commits if necessary.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 9s Maven dependency ordering for patch
+1 💚 mvninstall 1m 7s the patch passed
+1 💚 compile 0m 50s the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚 javac 0m 50s the patch passed
+1 💚 compile 0m 46s the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 💚 javac 0m 46s the patch passed
-0 ⚠️ checkstyle 0m 10s tez-tools/analyzers/job-analyzer: The patch generated 1 new + 66 unchanged - 2 fixed = 67 total (was 68)
+1 💚 whitespace 0m 0s The patch has no whitespace issues.
+1 💚 javadoc 0m 38s the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚 javadoc 0m 35s the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 💚 findbugs 1m 47s the patch passed
_ Other Tests _
+1 💚 unit 0m 29s tez-protobuf-history-plugin in the patch passed.
+1 💚 unit 2m 15s tez-history-parser in the patch passed.
+1 💚 unit 2m 33s job-analyzer in the patch passed.
+1 💚 asflicense 0m 29s The patch does not generate ASF License warnings.
37m 11s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-123/5/artifact/out/Dockerfile
GITHUB PR #123
JIRA Issue TEZ-4231
Optional Tests dupname asflicense javac javadoc unit spotbugs findbugs checkstyle compile
uname Linux ac8c6fb7724f 4.15.0-136-generic #140-Ubuntu SMP Thu Jan 28 05:20:47 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality personality/tez.sh
git revision master / 464d86d
Default Java Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
checkstyle https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-123/5/artifact/out/diff-checkstyle-tez-tools_analyzers_job-analyzer.txt
Test Results https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-123/5/testReport/
Max. process+thread count 877 (vs. ulimit of 5500)
modules C: tez-plugins/tez-protobuf-history-plugin tez-plugins/tez-history-parser tez-tools/analyzers/job-analyzer U: .
Console output https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-123/5/console
versions git=2.25.1 maven=3.6.3 findbugs=3.0.1
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

Copy link
Contributor

@rbalamohan rbalamohan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest patch LGTM. +1


JSONObject otherInfo = new JSONObject();
otherInfo.put(ATSConstants.START_TIME, startTime);
otherInfo.put(ATSConstants.START_TIME, event.getEventTime() - timeTaken);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the note. Earlier code didn't populate START_TIME (& had only timeTaken) causing the issue.

@abstractdog abstractdog merged commit 3f541d0 into apache:master Aug 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants