Conversation
Adds wrapper to topo conn that emits stats in all operations. Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
go/vt/topo/stats_conn.go
Outdated
|
|
||
| var ( | ||
| topoStatsConnTimings = stats.NewMultiTimings( | ||
| "TopologyConn", |
There was a problem hiding this comment.
I don't see any reason why we need topoStatsConnCounts as well as topoStatsConnTimings since the histograms already include the count (along with the timing measures).
Furthermore I think we should name these: TopologyOperations (the timings) and TopologyErrors (counter)
There was a problem hiding this comment.
Oh I did not know this. Cool.
go/vt/topo/server.go
Outdated
| topoGlobalRoot = flag.String("topo_global_root", "", "the path of the global topology data in the global topology server") | ||
|
|
||
| // emitTopoStats turns on StatsConn and stats will be emitted when making calls to underlying lock server | ||
| emitTopoStats = flag.Bool("emit_topo_stats", false, "when true, stats will be emitted when calling lock server") |
There was a problem hiding this comment.
IMO we should do this decoration all the time and not require a runtime flag.
There's no real downside to having the metrics tracked.
brirams
left a comment
There was a problem hiding this comment.
seems reasonable to me. just nits that you may take or leave as you see fit
falun
left a comment
There was a problem hiding this comment.
Change seems pretty chill; one question about stat granularity
| startTime := time.Now() | ||
| statsKey := []string{"ListDir", st.cell} | ||
| defer topoStatsConnTimings.Record(statsKey, startTime) | ||
| res, err := st.conn.ListDir(ctx, dirPath, full) |
There was a problem hiding this comment.
I'm not sure how much visibility you want into the class of failures but it potentially useful to tag based on the error code. So something like adding:
// vt/topo/errors.go
// caveat: ideally we'd convert this back to *string for display or have a parallel ErrorCodeName
func ErrorCode(e error) *ErrorCode {
if e == nil { return nil }
switch err := e.(type) {
case *Error:
return err.code
case Error:
return err.code
default:
return nil
}
}and then in your error handling (I'm taking some liberty with go for brevity ❤️ ... and also eliding some of the necessary error handling / details)
if err != nil {
topoStatsConnErrors.Add(statsKey, int64(1))
topoStatsConnScopedErrors.Add([...statsKey, topo.ErrorCode(err)], int64(1))
}There was a problem hiding this comment.
I think for now I would like to keep the generic error count. I'm mostly interested in tracking general errors as we make the changes from skipping consul agent and talking to consul via the HA proxy vs normal operations.
Should we iterate in the future if we see need for more granularity ?
There was a problem hiding this comment.
Minor, not strong, objections; presumably something downstream of these calls is logging error codes so we can check aggregation for details if necessary?
My thought here is that if our error counts are non-zero (and we don't know because we don't have telemetry right now, correct?) then moving from non-zero to non-zero as we shift between consul and consul-via-proxy doesn't actually tell us if we've functionally changed the behavior since the type of errors are also important.
Two caveats
- If we know/suspect there are no errors in steady-state then this is less important
- If my read of
stats.CountersWithMultiLabelsis incorrect and it's more effort than just adding a single scoped counter then the effort:(payoff | code noise) ratio changes and it's less compelling for me
That said, if you want, I'm pretty okay pushing this into the the future should we need it.
There was a problem hiding this comment.
I agree with your reasoning here.
My current understanding is number 1. I'm expecting to have zero errors during normal operations. So thinking it should be ok.
4c2bec1 to
f00810f
Compare
Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
f00810f to
87f35ad
Compare
|
Thanks for the feedback! I think all comments have been addressed. |
Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
Description
We are in the process of doing a topology migration. While we are doing this, it will be helpful to have more telemetry into the operations that Vitess is doing to the underlying lock service.
The following PR adds
stats_conn. This is a decorator for the topology connection that emits stats when any operation is performed. It maintains stats per operation and cell.Tests
I added relevant unit tests and tested in our dev environment.