-
Notifications
You must be signed in to change notification settings - Fork 440
TEZ-4231: SimpleHistoryParser doesn't merge events correctly #123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@rbalamohan: could you please take a look? new analyzers + minor bugfixes |
|
@rbalamohan : ping, if you have some cycles to review this fix/improvement :) |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @abstractdog for the patch. Added review comments.
| } | ||
|
|
||
| // attempt_1599682376162_0006_27_00_000086_1 | ||
| int attemptNumber = Integer.parseInt(attempt.getTaskAttemptId().split("_")[6]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace with "TezTaskAttemptID.fromString(attempt.getTaskAttemptId()).getId()}" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, right, what a hack this parseInt was :)
| // attempt_1599682376162_0006_27_00_000086_1 | ||
| int attemptNumber = Integer.parseInt(attempt.getTaskAttemptId().split("_")[6]); | ||
| if (attemptNumber == numAttemptsForTask - 1) { | ||
| thisTaskData.put("last_attempt_id", attempt.getTaskAttemptId()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Declare as static final Strings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
|
|
||
| // attempt_1599682376162_0006_27_00_000086_1 | ||
| int attemptNumber = Integer.parseInt(attempt.getTaskAttemptId().split("_")[6]); | ||
| if (attemptNumber == numAttemptsForTask - 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes all the attempts may get scheduled on the same node and fail. It will be good to understand that as well. Would you like to refactor it such that, it can provide all id/status and respective node details?.
While providing final detail, it can be a concatenated string as well (to make it readable and printable easily)
It will be nice to provide the info in this analyzer itself; (Without this info, user may have to co-relate with results of TaskAssignmentAnalyzer for node analysis).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, adding status in "id/status" format to the "last_attempt_id" column
node info is already there in the last column: "last_attempt_node"
|
|
||
| @Override | ||
| public String getDescription() { | ||
| return "TaskHandAnalyzer can give quick insights about hanging tasks/task attempts" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixing to simply "hanging task attempts"
| /** | ||
| * Get the Task assignments on different nodes of the cluster. | ||
| */ | ||
| public class TaskHangAnalyzer extends TezAnalyzerBase implements Analyzer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename as "HungTaskAnalyzer" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, makes sense
| import java.util.Map; | ||
|
|
||
| /** | ||
| * Get the Task assignments on different nodes of the cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix comments as this analyser is related to hung task analysis
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
| /** | ||
| * This analyzer is support to collect which nodes can be blamed for shuffle read errors. | ||
| */ | ||
| public class InputReadErrorAnalyzer extends TezAnalyzerBase implements Analyzer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, also I'm fixing "is support to collect", now it doesn't seem to be correct somehow
|
|
||
| JSONObject otherInfo = new JSONObject(); | ||
| otherInfo.put(ATSConstants.START_TIME, startTime); | ||
| otherInfo.put(ATSConstants.START_TIME, event.getEventTime() - timeTaken); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, there were corner cases where events will not be properly populated. May be in cases, where vertices were shutdown due to errors or so (need to check).
In such cases, this would have returned "-ve" value earlier.
Current patch seem to change the start_time, depending on getEventTime. This could give a perspective that the task/vertex was there for very short time.
Can you plz share more info on prev error? Were you getting -ve values earlier for which this is being modified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, I saw this for all protobuf history files, that's why I'm not suspecting corner case
I found that in case of a TASK_FINISHED finished event, there is always an event time (which is the end time obviously) and timeTaken in event_data, but there is no startTime there
this string is what the debugger writes for a HistoryLoggerProtos$HistoryEventProto instance while doing this conversion:
event_type: "TASK_FINISHED"
event_time: 1628149977709
app_id: "application_1628051798891_0030"
dag_id: "dag_1628051798891_0030_1"
vertex_id: "vertex_1628051798891_0030_1_00"
task_id: "task_1628051798891_0030_1_00_000001"
event_data {
key: "timeTaken"
value: "4193"
}
event_data {
key: "status"
value: "SUCCEEDED"
}
event_data {
key: "numFailedTaskAttempts"
value: "0"
}
event_data {
key: "successfulAttemptId"
value: "attempt_1628051798891_0030_1_00_000001_0"
}
event_data {
key: "diagnostics"
value: ""
}
event_data {
key: "counters"
value: "..."
}
the root cause of this behavior would be:
https://github.com/apache/tez/blob/master/tez-plugins/tez-protobuf-history-plugin/src/main/java/org/apache/tez/dag/history/logging/proto/HistoryEventProtoConverter.java#L392
private HistoryEventProto convertTaskFinishedEvent(TaskFinishedEvent event) {
HistoryEventProto.Builder builder = makeBuilderForEvent(event, event.getFinishTime(),
null, null, null, null, event.getTaskID(), null, null);
addEventData(builder, ATSConstants.TIME_TAKEN, (event.getFinishTime() - event.getStartTime()));
here I can see the builder consumes only event.getFinishTime() for the "time" parameter, and startTime is shipped indirectly...according to blame, this code part is unchanged since the introduction of proto history logger (TEZ-3915)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the note. Earlier code didn't populate START_TIME (& had only timeTaken) causing the issue.
| // if DAG_PLAN is not filled already, let's try to fetch it from other | ||
| dagJson.getJSONObject(ATSConstants.OTHER_INFO).put(ATSConstants.DAG_PLAN, jsonObject | ||
| .getJSONObject(ATSConstants.OTHER_INFO).getJSONObject(ATSConstants.DAG_PLAN)); | ||
| } else{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: fix indent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
|
|
||
| public class DagOverviewAnalyzer extends TezAnalyzerBase implements Analyzer { | ||
| private final String[] headers = | ||
| { "name", "id", "event_type", "status", "event_time", "event_time_str", "diagnostics" }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to include the number of tasks assigned in the vertex as well (can be added in another field called "comments" or "additional info" which can be populated optionally).
e.g "numTasks: " vertex.getNumTasks() + ", failedTasks: " + vertex.getFailedTasks().size()
+ ", completedTasks: " + vertex.getCompletedTasksCount()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense, I'm adding a "vertex_task_stats" before diagnostics column for better readability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment has been minimized.
This comment has been minimized.
|
💔 -1 overall
This message was automatically generated. |
rbalamohan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Latest patch LGTM. +1
|
|
||
| JSONObject otherInfo = new JSONObject(); | ||
| otherInfo.put(ATSConstants.START_TIME, startTime); | ||
| otherInfo.put(ATSConstants.START_TIME, event.getEventTime() - timeTaken); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the note. Earlier code didn't populate START_TIME (& had only timeTaken) causing the issue.
Problems fixed:
3. HistoryEventProtoJsonConversion: TASK_FINISHED event and VERTEX_FINISHED events don't contain starTime, only timeTaken, so as timeTaken is fix, startTime should be derivated from that, not the opposite way (this caused non-sense task durations in analyzers while parsing proto history files)
Refactoring:
removed configuration object from analyzers as TezAnalyzerBase is already a Configured class
New analyzers:
InputReadErrorAnalyzer
DagOverviewAnalyzer
TaskHangAnalyzer
attached example excel sheets generated with the analyzers to jira