Skip to content

Adds stats to topo#4404

Merged
demmer merged 6 commits intovitessio:masterfrom
tinyspeck:adds-stats-to-topo
Dec 3, 2018
Merged

Adds stats to topo#4404
demmer merged 6 commits intovitessio:masterfrom
tinyspeck:adds-stats-to-topo

Conversation

@rafael
Copy link
Copy Markdown
Member

@rafael rafael commented Nov 29, 2018

Description

We are in the process of doing a topology migration. While we are doing this, it will be helpful to have more telemetry into the operations that Vitess is doing to the underlying lock service.

The following PR adds stats_conn. This is a decorator for the topology connection that emits stats when any operation is performed. It maintains stats per operation and cell.

Tests

I added relevant unit tests and tested in our dev environment.

Rafael Chacon added 4 commits November 28, 2018 15:37
Adds wrapper to topo conn that emits stats in all operations.

Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
@rafael rafael requested a review from sougou as a code owner November 29, 2018 19:50
Copy link
Copy Markdown
Member

@demmer demmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good


var (
topoStatsConnTimings = stats.NewMultiTimings(
"TopologyConn",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any reason why we need topoStatsConnCounts as well as topoStatsConnTimings since the histograms already include the count (along with the timing measures).

Furthermore I think we should name these: TopologyOperations (the timings) and TopologyErrors (counter)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I did not know this. Cool.

topoGlobalRoot = flag.String("topo_global_root", "", "the path of the global topology data in the global topology server")

// emitTopoStats turns on StatsConn and stats will be emitted when making calls to underlying lock server
emitTopoStats = flag.Bool("emit_topo_stats", false, "when true, stats will be emitted when calling lock server")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should do this decoration all the time and not require a runtime flag.

There's no real downside to having the metrics tracked.

Copy link
Copy Markdown
Collaborator

@brirams brirams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems reasonable to me. just nits that you may take or leave as you see fit

Copy link
Copy Markdown
Contributor

@falun falun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change seems pretty chill; one question about stat granularity

startTime := time.Now()
statsKey := []string{"ListDir", st.cell}
defer topoStatsConnTimings.Record(statsKey, startTime)
res, err := st.conn.ListDir(ctx, dirPath, full)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how much visibility you want into the class of failures but it potentially useful to tag based on the error code. So something like adding:

// vt/topo/errors.go
// caveat: ideally we'd convert this back to *string for display or have a parallel ErrorCodeName
func ErrorCode(e error) *ErrorCode {
	if e == nil { return nil }
	switch err := e.(type) {
	case *Error:
		return err.code
	case Error:
		return err.code
	default:
		return nil
	}
}

and then in your error handling (I'm taking some liberty with go for brevity ❤️ ... and also eliding some of the necessary error handling / details)

if err != nil {
  topoStatsConnErrors.Add(statsKey, int64(1))
  topoStatsConnScopedErrors.Add([...statsKey, topo.ErrorCode(err)], int64(1))
}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for now I would like to keep the generic error count. I'm mostly interested in tracking general errors as we make the changes from skipping consul agent and talking to consul via the HA proxy vs normal operations.

Should we iterate in the future if we see need for more granularity ?

Copy link
Copy Markdown
Contributor

@falun falun Nov 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor, not strong, objections; presumably something downstream of these calls is logging error codes so we can check aggregation for details if necessary?

My thought here is that if our error counts are non-zero (and we don't know because we don't have telemetry right now, correct?) then moving from non-zero to non-zero as we shift between consul and consul-via-proxy doesn't actually tell us if we've functionally changed the behavior since the type of errors are also important.

Two caveats

  1. If we know/suspect there are no errors in steady-state then this is less important
  2. If my read of stats.CountersWithMultiLabels is incorrect and it's more effort than just adding a single scoped counter then the effort:(payoff | code noise) ratio changes and it's less compelling for me

That said, if you want, I'm pretty okay pushing this into the the future should we need it.

Copy link
Copy Markdown
Member Author

@rafael rafael Nov 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your reasoning here.

My current understanding is number 1. I'm expecting to have zero errors during normal operations. So thinking it should be ok.

@rafael rafael force-pushed the adds-stats-to-topo branch 2 times, most recently from 4c2bec1 to f00810f Compare November 29, 2018 21:36
Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
@rafael rafael force-pushed the adds-stats-to-topo branch from f00810f to 87f35ad Compare November 29, 2018 21:38
@rafael
Copy link
Copy Markdown
Member Author

rafael commented Nov 29, 2018

Thanks for the feedback! I think all comments have been addressed.

Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
@demmer demmer merged commit df486b6 into vitessio:master Dec 3, 2018
@rafael rafael deleted the adds-stats-to-topo branch December 3, 2018 16:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants