Add support for exporting metrics to Prometheus.#3784
Add support for exporting metrics to Prometheus.#3784michael-berlin merged 14 commits intovitessio:masterfrom
Conversation
go/stats/counters.go
Outdated
| publish(name, t) | ||
| } | ||
|
|
||
| publishPullMultiGauges(t, name) |
There was a problem hiding this comment.
I think you should be able to do this by registering a new var hook using stats.Register, and have that function make these calls. This way, there will be no need to spray these additional publish calls through all the types.
go/stats/export.go
Outdated
| } | ||
|
|
||
| // Help returns the help string | ||
| func (v *Int) Help() string { |
There was a problem hiding this comment.
For all these simple types Int is same as IntGauge. We might as well have only one of them. Maybe we can just rename Int to IntGauge.
go/stats/export.go
Outdated
| // String is the implementation of expvar.var | ||
| func (f IntFunc) String() string { | ||
| return strconv.FormatInt(f(), 10) | ||
| func (intFunc IntFunc) String() string { |
There was a problem hiding this comment.
Go style recommends using single letter names for simple types like this. I don't necessarily agree with it, but it's better to follow the guidelines :).
go/stats/prombackend/prombackend.go
Outdated
| func Init(namespace string) { | ||
| http.Handle("/metrics", promhttp.Handler()) | ||
| promBackend := &PromBackend{namespace: namespace} | ||
| stats.RegisterPullBackendImpl("prom", promBackend) |
There was a problem hiding this comment.
This should just call stats.Register with that new function that can call the various pull functions.
go/stats/prombackend/prombackend.go
Outdated
| return output | ||
| } | ||
|
|
||
| func toSnake(name string) string { |
There was a problem hiding this comment.
@michael-berlin is there a way you can export youtube's function that converts to snake-case? I'm assuming that it shouldn't have anything proprietary.
There was a problem hiding this comment.
Moving the snake case / kebab-case conversation here:
(From @michael-berlin in Reviewable:) Turns out snake_case uses underscores but internally we're doing kebab-case (with hyphens).
I've exported the code: #3816
But I'm not sure if it's worth it merging it.
Also note that we do not have special handling for certain words e.g. the resulting internal variable name is vtgate-v-schema-counts :)
Ooh. Thanks for exporting!
@sougou Thoughts here? I can also just change my toSnakeCase to follow what toKebabCase is doing with the regexp package (which looks significantly cleaner than my hack with runes), and add a test.
I would also prefer to do some special casing (For things like VSchema, VtGate) as part of the Prometheus export.
go/stats/prombackend/collectors.go
Outdated
| for i := range cutoffs { | ||
| key := float64(cutoffs[i]) / 1000000000 | ||
| //TODO(zmagg): int64 => uint64 conversion. error if it overflows? | ||
| output[key] = uint64(buckets[i]) + last |
There was a problem hiding this comment.
It seems to me the buckets should just be uint64 from the start.
There was a problem hiding this comment.
Hmm. They're sync2.AtomicInt64s in histogram.go.
There was a problem hiding this comment.
I think that's wrong too :)
Of course the "right" thing seems to me that we add a sync2.AtomicUint64 and change all the histogram metrics to use that so we don't have to do this dance here.
@sougou any opinion here?
There was a problem hiding this comment.
Agree that float64 should not be used as key because it's not an exact type. However, I recommend using sync2.AtomicInt64.
Basically, unsigned ints should only be used to represent non-numbers, like bitmaps. This is actually a coding standard within google, and I agree with it based on my own experience.
Using unsigned has the same problems of a const char *. It eventually poisons the entire code base.
Having said that, we have a few embarrassing places where uin64 is used as a number. It's mostly contained for now. Someday, we'll fix it :).
There was a problem hiding this comment.
Interesting. Well ok then, I guess we stick with signed ints and this casting since the prom API wants unsigned ints for bucket levels (which does still seem more logical to me).
2648606 to
53195d7
Compare
|
Updated description to account for new implementation changes and new TODOs. : ) |
go/stats/pull_backend.go
Outdated
| @@ -0,0 +1,27 @@ | |||
| package stats | |||
|
|
|||
| // PullBackend should be implemented to export pull metrics from Vitess. | |||
There was a problem hiding this comment.
With the new change to just use the stats.Publish hook I actually don't think we need this interface any more since it looks like we never use it anywhere.
That also would let all the prom registration functions be private to the package instead of public.
go/stats/prombackend/collectors.go
Outdated
|
|
||
| // metricsCollector collects both stats.Counters and stats.Gauges | ||
| type metricsCollector struct { | ||
| counters map[*stats.Counters]*prom.Desc |
There was a problem hiding this comment.
From what I can tell reading below, it looks like we never actually put more than one stats.Counters into any given metricsCollector, so can this be simplified to just have a pointer to the counter and a pointer to the Desc?
go/mysql/server.go
Outdated
| connAccept = stats.NewInt("MysqlServerConnAccepted") | ||
| connSlow = stats.NewInt("MysqlServerConnSlow") | ||
| timings = stats.NewTimings("MysqlServerTimings", "MySQL server timings") | ||
| connCount = stats.NewCounter("MysqlServerConnCount", "Connection count for MySQL servers") |
There was a problem hiding this comment.
connCount is a gauge since it is the "Active mysql server connections", while connAccept is the "Count of accepted mysql server connections.
IMO it would be clearer if we renamed the vars themselves to connsActive and connsAccepted to make this clearer.
There was a problem hiding this comment.
(Changed the help string but I didn't rename the vars.)
go/mysql/server.go
Outdated
| timings = stats.NewTimings("MysqlServerTimings", "MySQL server timings") | ||
| connCount = stats.NewCounter("MysqlServerConnCount", "Connection count for MySQL servers") | ||
| connAccept = stats.NewCounter("MysqlServerConnAccepted", "Connections accepted by MySQL server") | ||
| connSlow = stats.NewCounter("MysqlServerConnSlow", "Slow MySQL server connections") |
There was a problem hiding this comment.
A better description would be "Count of connections that took more than the configured mysql_slow_connect_warn_threshold to establish.
go/proc/counting_listener.go
Outdated
| ConnCount: stats.NewInt(countTag), | ||
| ConnAccept: stats.NewInt(acceptTag), | ||
| ConnCount: stats.NewCounter(countTag, "Connection count inside net.Listener"), | ||
| ConnAccept: stats.NewCounter(acceptTag, "Connections accepted inside net.Listener"), |
There was a problem hiding this comment.
ConnCount is also a Gauge. I'd recommend the same help string as with the mysql protocol case.
go/stats/counters.go
Outdated
| return v | ||
| } | ||
|
|
||
| // Add adds the provided value to the Int |
go/stats/counters.go
Outdated
| } | ||
|
|
||
| // ResetCounter resets a specific counter value to 0 | ||
| func (c *Counters) ResetCounter(name string) { |
There was a problem hiding this comment.
For consistency, I would change this function to be Reset(name string) and the above one ResetAll().
| UserTableQueryCount = stats.NewMultiCounters("UserTableQueryCount", []string{"TableName", "CallerID", "Type"}) | ||
| UserTableQueryCount = stats.NewCountersWithMultiLabels( | ||
| "UserTableQueryCount", | ||
| "Number of queries received for each CallerID/table comb", |
| UserTableQueryTimesNs = stats.NewMultiCounters("UserTableQueryTimesNs", []string{"TableName", "CallerID", "Type"}) | ||
| UserTableQueryTimesNs = stats.NewCountersWithMultiLabels( | ||
| "UserTableQueryTimesNs", | ||
| "Shows total latency for each CallerID/table combo", |
There was a problem hiding this comment.
Again I think drop "Shows" and change "combo" to "combination"
| // And we can remove the tsOnce variable. | ||
| tsOnce.Do(func() { | ||
| stats.Publish("TabletState", stats.IntFunc(func() int64 { | ||
| stats.NewGaugeFunc("TabletState", "Tablet server's state", (func() int64 { |
| queueExceeded = stats.NewCounters("TxSerializerQueueExceeded") | ||
| queueExceeded = stats.NewCountersWithLabels( | ||
| "TxSerializerQueueExceeded", | ||
| "Number of transactions that were rejcted because the max queue size per row range was exceeded", |
go/vt/worker/worker.go
Outdated
| []string{"Keyspace", "ShardName", "ThreadId"}) | ||
| // statsStateDurations tracks for each state how much time was spent in it. Mainly used for testing. | ||
| statsStateDurationsNs = stats.NewCounters("WorkerStateDurations") | ||
| statsStateDurationsNs = stats.NewGaugesWithLabels("WorkerStateDurations", "How much time was spent in each state", "state") |
There was a problem hiding this comment.
Hmm. It's a good idea. Followup PR? I'd like to avoid breaking more in this one. : )
|
Reviewed 5 of 43 files at r2, 3 of 19 files at r3, 3 of 12 files at r4. go/cmd/vtgate/plugin_prombackend.go, line 27 at r4 (raw file):
Is it necessary to have a different namespace for each binary? If so, what about other binaries which we have e.g. vtctld and vtworker? Does Prometheus require that? Its internal Google counterpart doesn't care and instead can distinguish binaries by their production job name ;) go/stats/export.go, line 266 at r4 (raw file):
As @demmer pointed out before, you'll probably need a Note that the removal/rename of this will break the Google internal plugins. Given that, please give me or @alainjobart a head's up before merging this. This way, we can change all internal callers first and verify that they won't break due to this change. go/stats/prombackend/prombackend.go, line 144 at r1 (raw file): Previously, sougou (Sugu Sougoumarane) wrote…
Turns out snake_case uses underscores but internally we're doing kebab-case (with hyphens). I've exported the code: #3816 But I'm not sure if it's worth it merging it. Also note that we do not have special handling for certain words e.g. the resulting internal variable name is go/stats/prombackend/prombackend.go, line 11 at r4 (raw file):
Given this style guide recommendation, you shouldn't rename the import to "prom". (In general, I personally don't like abbreviations (unless it's a local variable or a Go receiver name). E.g. "prombackend" is less descriptive than "prometheus_backend". Similar comment for the other names e.g. "PublishPromMetric".) go/vt/vtgate/masterbuffer/masterbuffer.go, line 50 at r4 (raw file):
Head's up: This is dead code which I'm going to delete here: #3814 go/vt/worker/worker.go, line 63 at r4 (raw file):
Can you please delete the comments since you copied them into the description now? Please make sure that no information gets lost e.g. the description for this stats var misses the examples Comments from Reviewable |
|
Replying here in a few comments, as I can't respond in Reviewable for now. (I need to check with some internal people about the permissions that responding in Reviewable requires (it asked for write perms to all my orgs...?)) Small stuff first:
@demmer and I talked about this a while back, but I forget why we landed on this. Prometheus does let you specify a namespace per binary in the static scrape config, which we could use instead of putting the namespace here. @demmer, any new thoughts?
That sounds good actually, I'll do a rename pass.
Thanks for the head's up : )
Will do! |
| return strconv.FormatInt(int64(v.i.Get()), 10) | ||
| } | ||
|
|
||
| // IntFunc converts a function that returns |
There was a problem hiding this comment.
(From @michael-berlin in Reviewable): As @demmer pointed out before, you'll probably need a CounterFunc which will be the renamed version of this.
Note that the removal/rename of this will break the Google internal plugins. Given that, please give me or @alainjobart a head's up before merging this. This way, we can change all internal callers first and verify that they won't break due to this change.
Yes! It's changed to a GaugeFunc in a9d7fef
I think we'll also need a CounterFunc, for the ones that are Counters and not Gauges.
re: breaking the Google internal plugins
I'd love to understand how the Google internal plugins use this so that I can avoid breaking it. Are you using the expvar hook or some other way of exporting and converting the expvars to the internal system?
Is it the removal here that is likely to break (So, here, we went from IntFunc as a parameter to stats.Publish to explicitly creating a new GaugeFunc), or is it the any/all metric type renames? If it's the latter, head's up that I actually renamed almost all of the metric names. Some of these were to more clearly indicate the Gauge/Counter difference. (Full map in PR description).
Will that require a lot of work on y'all's part to update the plugins internally?
There was a problem hiding this comment.
Internal code basically just adds additional stats variables.
Examples:
stats.Publish("XXX", stats.IntFunc(<some method>))
stats.Publish("XXX", stats.FloatFunc(func() float64 { return float64(<some method>) }))
timings: stats.NewMultiTimings("VtgateXxxApi", []string{"Operation", "Keyspace", "DbType"}),
errorCounters = stats.NewCounters("XXXErrors", "A", "B")
requests = stats.NewMultiCounters("XXX", []string{"A", "B", "C"})As long as it's easy to rewrite this code, it won't be a lot of work and fine.
go/cmd/vtgate/plugin_prombackend.go
Outdated
| ) | ||
|
|
||
| func init() { | ||
| prombackend.Init("vtgate") |
There was a problem hiding this comment.
Starting a thread here instead of using Reviewable:
I wrote:
Is it necessary to have a different namespace for each binary?
If so, what about other binaries which we have e.g. vtctld and vtworker?
Does Prometheus require that? Its internal Google counterpart doesn't care and instead can distinguish binaries by their production job name ;)
zmagg@ responded:
@demmer and I talked about this a while back, but I forget why we landed on this. Prometheus does let you specify a namespace per binary in the static scrape config, which we could use instead of putting the namespace here. @demmer, any new thoughts?
The plugin here actually does two things:
- Install the "/metrics" handler.
- Set the namespace.
Would it make sense to move 1. to our servenv instead and install the handler by default? What are the downsides of having the handler in there by default?
re: 2. If setting the namespace is a crucial aspect, I would prefer to do it in code. This way it's more reusable. Otherwise, other people will ask you for this static map config ;) If you set the namespace in code, my comment is that you should set the namespace for all binaries, not just vtgate and vttablet.
There was a problem hiding this comment.
Oh interesting.
- I think moving it into servenv makes sense, location-wise. The thought here was that we wanted to be able to not have any Prometheus-stuff running if a build didn't want it, but I think we can handle that inside servenv with a flag.
- Yeah, setting the namespace is crucial. I've added one for the
vtworkerandvtctlbinaries now, are there any that I missed that metrics are in too? Those are all the binaries that we use at Slack.
|
Updated the PR description for what's left. Commit I just pushed included responses to all the review comments except:
Also, I still need to add unit tests, and we need to resolve what to do about this toSnakeCase thing. |
|
@demmer Can you take another look? The PR description is up to date with my latest changes. : ) One thing I'm considering adding is preventing Counter-type metrics from calling |
demmer
left a comment
There was a problem hiding this comment.
This is looking really really close!
Most of these comments relate to description nit picks.
The only substantive issue relates to the need for Counter/Gauge variants of DurationFunc.
Also as you I'm sure noticed, there are merge conflicts and test / CodeClimate cleanup.
There was a problem hiding this comment.
Rather than make each of the plugins pull in the promhttp details and register the handler, why not move this inside prometheusbackend.Init.
go/mysql/server.go
Outdated
There was a problem hiding this comment.
Admittedly this is taste, but I find the use of "Count of" to be unnecessary in this and other descriptions. Ideally we should be consistent one way or the other and either always use it or never use it.
Personally I find "Accepted MySQL server connections" to be sufficiently descriptive and more consistent with the gauge case (where we don't say "Gauge of active MySQL server connections").
go/proc/counting_listener.go
Outdated
There was a problem hiding this comment.
It has always been confusing to me that ConnCount and ConnAccepted don't have more descriptive names, and these generic descriptions don't help either :)
While I'm happy to defer this to a subsequent improvement, IMO we should change the proc.Listen API to include a description string and pass that here as well along with the countTag.
Then these would be:
ConnCount: stats.NewGauge(countTag, fmt.Sprintf("Active %s connections", description)),
ConnAccept: stats.NewCounter(countTag, fmt.Sprintf("Accepted %s connections", description)),
Since these are only used for HTTP, we would then change servenv.Run to include "HTTP" as the description string.
I also think we would be well served to rename the generic metric names as well to be HTTPConnCount and HTTPConnAccepted but that would break backwards compatibility. @sougou / @michael-berlin thoughts on this?
There was a problem hiding this comment.
Agreed, but let's do it in a follow-on?
There was a problem hiding this comment.
Yup. Would still like to hear from @sougou and @michael-berlin about the backwards-incompatibility of this proposed change.
go/stats/counters.go
Outdated
There was a problem hiding this comment.
FWIW the prom API includes an Inc() method which seems nicer than Add(1), especially if we follow through with the idea to prevent Add of a negative number.
There was a problem hiding this comment.
Added logging for now!
There was a problem hiding this comment.
Why do we need this conversion and the ValueType enum at all? It seems like holdover from the more generic pullbackend design, and with this current approach we can avoid the conversion by just passing the appropriate prometheus metric type along when we instantiate the collector.
| stats.NewGaugeFunc(name+"Active", "Tablet server conn pool active", cp.Active) | ||
| stats.NewGaugeFunc(name+"InUse", "Tablet server conn pool in use", cp.InUse) | ||
| stats.NewGaugeFunc(name+"MaxCap", "Tablet server conn pool max cap", cp.MaxCap) | ||
| stats.NewGaugeFunc(name+"WaitCount", "Tablet server conn pool wait count", cp.WaitCount) |
There was a problem hiding this comment.
TableACLExemptCount is a Counter.
There was a problem hiding this comment.
QueryCacheEvictions is a counter
There was a problem hiding this comment.
More description nits: "Shows the..." should be just "Wait operations" I think
There was a problem hiding this comment.
I think:
"Critical errors"
"Internal component errors"
...
|
Ok! Responded to your comments. Major changes:
I did a best effort pass through of which of the DurationFuncs were gauges/counters, but would love an extra eye on that @demmer. Also rebased, made all the other changes I 👍'd to your comments on. What's left:
|
3ca8b93 to
e298278
Compare
go/stats/kebab_case_converter.go
Outdated
There was a problem hiding this comment.
I added the toSnakeCase modifications based on the toKebabCase stuff. I'm not totally convinced it structurally makes sense to put it here (at minimum, I should rename the file if it otherwise makes sense to you).
Also, it shares a memoizer with the toKebabCase, which should be fine (one build should only use one or the other), but perhaps they should be separate in the code too?
@demmer, thoughts?
There was a problem hiding this comment.
For safety and cleanliness I actually don't think they should share a memoizer.
Given that... I don't think there's actually any substantive shared code any more, so rather than misleadingly sharing the file name, I'd say you should just create a snake_case_converter.go with the new code and a separate memoizer.
|
@demmer Thoughts now? |
|
From my standpoint the only remaining item before I think this is good to merge is the minor reorganization of where the snake conversion goes. @michael-berlin / @alainjobart since this will entail some breaking changes with internal Google importing I would like a 👍 from one of you that this is ok to merge. @sougou told me in DM he doesn't care, so he doesn't get a vote on this one. |
|
I do care, but I know it's in good hands with @michael-berlin and @alainjobart :) |
go/stats/counters.go
Outdated
There was a problem hiding this comment.
How likely is it that this will get called?
If it's not likely at all because you already checked all callers in open-source, then I'm okay with it as is.
Otherwise, it may be a good idea to use a throttled logger here instead. This way we would avoid that bad code spams the log, makes the disk full and potentially slows down the process.
There was a problem hiding this comment.
Oh, that's a good idea. I've checked all the easy sites (Add(-1)) but there's a lot more that wind their way through the code that I don't have confidence on, hence logging for now instead of banning decrementing all together. I switched to using a throttled logger here. I also switched to using a throttled logger for the unsupported export type logger in prometheusbackend.go, thanks for the suggestion!
|
It looks like this is good to merge? If so, let's do as follows:
Let me try to get to this today or tomorrow. |
|
@michael-berlin That sounds awesome! Let me figure out what's going on with these failing TravisCI tests after my last commit. The tests still succeed for me locally, so not sure. Debugging now. |
go/stats/export_test.go
Outdated
There was a problem hiding this comment.
These two variables are unused.
Same for TestDurationFunc below.
Also: The Register() here seems to be unnecessary as well?
Can you please double-check this file and report back? Thanks!
There was a problem hiding this comment.
Oh thanks for the catch. I added tests against gotname and gotv for GaugeFunc and DurationFunc.
|
I have an internal import open with all the necessary changes to our internal code. Tests are passing now. Can you please look at my last comments and address them in additional CLs? Once that's done, I'll patch them into my pending import as well and then we can wrap things up. |
|
@michael-berlin Yay, awesome! I responded to your comments and rebased to resolve the merge conflict. |
|
I ran into the merge conflict during the import as well. Therefore, I just reverted that PR. Given that, can you please rebase again against the latest master? |
Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
* Fix the nanoseconds => ints for histogram buckets. * Gigantic rename of metrics: -Int => Counter -IntGauge => Gauge -IntFunc => GaugeFunc -Counters => CountersWithLabels -MultiCounters => CountersWithMultiLabels -MultiCounterFunc => CountersFuncWithMultiLabels -Gauges => GaugesWithLabels -MultiGauges=> GaugesWithMultiLabels * Add an explicit labelName for CountersWithLabels & GaugesWithLabels Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
pull_backend.go stuff into prombackend/prombackend.go's Register call and make the pull_backend.go an interface only. Also, fix some more types from the gigantic refactor, and run GoLint/GoVet Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
Fix tests to use the new metric names. Fix tests that expected gauges and not counters. (more to come on this thing, but these were the tests that indicated gauges early) Fix the `Set()` implementation for GaugesWithMultiLabels. Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
Add plugins for each of the components. Rename prombackend => prometheusbackend Rename publishPromMetric to publishPrometheusMetric Rename Reset() => ResetAll() ResetCounter => Reset() Rename various Counters => Gauges per Mike's helpful pointers. Rename a few help strings. Remove the comments in go/vt/worker/worker.go metrics and move them 100% into help functions. Add a GaugeFuncWithMultiLabels Add a CounterFunc Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
Convert Nanoseconds to seconds Call Init (handler) inside servenv.OnRun() Add unit tests. For counter, it doesn't make sense to have a function called ResetAll and not one called Reset so re-rename that. Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
…icFunc` that allows us to both export ints (like it did before) and now also durations. - Moved Prom specific stuff in the plugins to the prometheus backend itself. - A ton of stats help string fix ups. - some counters are gauges, and some gauges are counters! Fix those. Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
case. Log an warning when we call Add() on a counter with a negative number. Fix the double underscore in the prometheus exported metrics when we dedupe the namespace. Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
Add a conversion function from - to _ and add unit test. Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
counters. Remove some unneccessary commented out code. Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
Don't export counters. Refactor prometheusbackend.go code to collect CountersWithLabels and GaugesWithLabels explicitly instead of relying on underlying unexported-now Counters. Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
6de12e7 to
8b2491a
Compare
|
I just did it myself and pushed it to your branch. |
An existing test used this functionality. When we stopped exporting the "counters" type, the test lost that functionality. Instead of relying on an implementation detail, there is a proper public method now. Signed-off-by: Michael Berlin <mberlin@google.com>
|
Internal import is done. I'll merge this as soon as Travis has passed. |
|
The race unit test did flake and therefore test shard 0 failed. I'm ignoring that and will merge this now. |
vitessio#3784 introduced two layers of code for "stats.Counter" (and the new type "stats.Gauge"). As a consequence, it was for example possible to create a "stats.Gauge" for a time.Duration value. This approach did not work well with our internal usage of the stats package. Therefore, I reversed this and simplified the code: - MetricFunc() interface was removed. Instead, a CountFunc or a GaugeFunc requires a simple func() int64 as input. - IntFunc() removed. Before vitessio#3784, it was used to implement the expvar.Var interface. But now, this is taken care of by the types itself e.g "stats.CounterFunc". Therefore, we do not need it anymore and users of the "stats" package can just pass a plain func() int64. - Added types "Duration" and "DurationFunc" back. - Added "Variable" interface. This allowed to simplify the Prometheus code which depends on the Help() method for each stats variable. - Prometheus: Conversion to float64 values is now done in prometheusbackend.go and removed from the stats package. BUG=78571948 Signed-off-by: Michael Berlin <mberlin@google.com>
vitessio/vitess#3784 introduced two layers of code for "stats.Counter" (and the new type "stats.Gauge"). As a consequence, it was for example possible to create a "stats.Gauge" for a time.Duration value. This approach did not work well with our internal usage of the stats package. Therefore, I reversed this and simplified the code: - MetricFunc() interface was removed. Instead, a CountFunc or a GaugeFunc requires a simple func() int64 as input. - IntFunc() removed. Before vitessio/vitess#3784, it was used to implement the expvar.Var interface. But now, this is taken care of by the types itself e.g "stats.CounterFunc". Therefore, we do not need it anymore and users of the "stats" package can just pass a plain func() int64. - Added types "Duration" and "DurationFunc" back. - Added "Variable" interface. This allowed to simplify the Prometheus code which depends on the Help() method for each stats variable. - Prometheus: Conversion to float64 values is now done in prometheusbackend.go and removed from the stats package. BUG=78571948 Signed-off-by: Michael Berlin <mberlin@google.com>
Description
WIP PR to export metrics to Prometheus, as discussed in #3644
I've gotten feedback from @demmer on the approach in this PR, but I'd like to get some feedback from others as well, especially around metric names and the API exposed. Despite the long list of changes below, there are no breaking changes in this PR for anybody that relies on the expvar stats.
Metrics backend changes / naming stuff
pull_backend.gowhich supports registering generic pull backends which can scrape for metrics.Right now, it also adds the name of the component as a namespace to the metric, but that produces some metric names that have duplicate prefixes, as we already prefix a lot of the Vitess metric names with the component type (e.g.
VtgateVschemaCountsturns intovtgate_vtgate_vschema_count)Instrumentation changes
Adds Gauge-style types (Gauge, GaugeWithLabels, etc.) Prometheus expects Gauges and Counters to be separate types and we need this information in order to specify what kind of metric we're exporting.
Renamed all metric types. Mapping of rename:
Int=>Counter/Gauge(depending)IntFunc=>GaugeFuncCounters=>CountersWithLabels/GaugesWithLabels(depending)MultiCounters=>CountersWithMultiLabels/GaugesWithMultiLabels(depending)MultiCounterFunc=>CountersFuncWithMultiLabelsAdds a description string per metric.
Adds an explicit labelName for the tags in
MultiCounters. Prometheus's labels are all key/value pairs.Changes calls to
stats.Publish(<IntFunc>/<DurationFunc>)to callstats.NewIntFunc/stats.NewDurationFuncexplicitly instead.TODO still
prombackendplugin for each component with metrics to be exportedMetric types not exported
NewString / NewStringMap / PublishJSONFunc
Prom. doesn't handle string values. Some of these could be converted into different types, if we'd like to export the metrics anyway.
NewFloat
This isn't called anywhere today. We could maybe export them anyway in case there are any callers in the future.
direct calls to stats.Publish(...CounterFunc())
I think this should get ported in a followup PR, possibly refactoring it to use MultiCounterFunc.
Additional things to do in follow-on PRs
This change is