HDFS-16595. Slow peer metrics - add median, median absolute deviation and upper latency limits#4357
Conversation
|
💔 -1 overall
This message was automatically generated. |
|
@tomscut @jojochuang @saintstack Could you please review this PR? The QA failures with |
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
tomscut
left a comment
There was a problem hiding this comment.
The whole thing looks good to me. I have two quick questions:
- Is there any further work after adding these metrics? Where are these metrics going to be used?
- We changed the JSON structure of
slowpeer, which may be an incompatible change. We need add some releaseNote.
There was a problem hiding this comment.
Can we do not define a new variable outlierMetrics1 here, just to make it a little bit cleaner.
There was a problem hiding this comment.
Sorry I didn't get it. It's already outlierMetrics1 :)
There was a problem hiding this comment.
I mean just write it like this:
tracker.addReport("node2", "node1", new OutlierMetrics(0.0, 0.0, 0.0, 1.2));
There was a problem hiding this comment.
Got it, addressed this change in the latest commit, thanks @tomscut
virajjasani
left a comment
There was a problem hiding this comment.
Is there any further work after adding these metrics? Where are these metrics going to be used?
I believe this is it, we don't seem to have any further addition to the metrics for slow latency measurements. These metrics are for operators to get better insights, e.g. if upper latency limit changes over time, some datanodes might be reported slow but they might actually not be slow. Whereas if upper bound doesn't change much and there is significant difference b/ reported latency of a datanode vs upper limit, then operator should immediately take actions.
We changed the JSON structure of slowpeer, which may be an incompatible change. We need add some releaseNote.
Sure let me add release node once the PR is merged, does it sound good?
Thanks @tomscut for the reviews!
There was a problem hiding this comment.
Sorry I didn't get it. It's already outlierMetrics1 :)
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
tomscut
left a comment
There was a problem hiding this comment.
Looks good from my side.
|
Hi @jojochuang , could you please also take a look. Thanks. |
There was a problem hiding this comment.
move .setMad to its own line?
| .setMedian(outlierMetrics.getMedian()).setMad(outlierMetrics.getMad()) | |
| .setMedian(outlierMetrics.getMedian()) | |
| .setMad(outlierMetrics.getMad()) |
There was a problem hiding this comment.
Done, thanks for the suggestion.
64a1805 to
e94f367
Compare
|
🎊 +1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
Test failures in |
…its (apache#4357) Reviewed-by: Tao Li <tomscut@apache.org> Signed-off-by: Wei-Chiu Chuang <weichiu@apache.org>
…its (apache#4357) Reviewed-by: Tao Li <tomscut@apache.org> Signed-off-by: Wei-Chiu Chuang <weichiu@apache.org>
Description of PR
Slow datanode metrics include slow node and it's reporting node details. With HDFS-16582, we added the aggregate latency that is perceived by the reporting nodes.
In order to get more insights into how the outlier slownode's latencies differ from the rest of the nodes, we should also expose median, median absolute deviation and the calculated upper latency limit details.
How was this patch tested?
UTs and Dev cluster testing:
For code changes: