Skip to content

go/{stats,vt}: publish VReplicationStreamState to prometheus backend#12772

Merged
mattlord merged 4 commits intovitessio:mainfrom
planetscale:maxeng-vrep-prom-metrics-workflow-status
Mar 31, 2023
Merged

go/{stats,vt}: publish VReplicationStreamState to prometheus backend#12772
mattlord merged 4 commits intovitessio:mainfrom
planetscale:maxeng-vrep-prom-metrics-workflow-status

Conversation

@maxenglander
Copy link
Copy Markdown
Collaborator

@maxenglander maxenglander commented Mar 30, 2023

Description

In order to make it easier to monitor VReplication workflow status, expose the existing VReplicationStreamState as a 1-valued Prometheus gauge. S/o @mcrauwel for the idea.

Related Issue(s)

Checklist

  • "Backport to:" labels have been added if this change should be back-ported
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on the CI
  • Documentation was added or is not required

Signed-off-by: Max Englander <max@planetscale.com>
Signed-off-by: Max Englander <max@planetscale.com>
@vitess-bot vitess-bot bot added NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says labels Mar 30, 2023
@vitess-bot
Copy link
Copy Markdown
Contributor

vitess-bot bot commented Mar 30, 2023

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • If this is a change that users need to know about, please apply the release notes (needs details) label so that merging is blocked unless the summary release notes document is included.
  • If a test is added or modified, there should be a documentation on top of the test to explain what the expected behavior is what the test does.

If a new flag is being introduced:

  • Is it really necessary to add this flag?
  • Flag names should be clear and intuitive (as far as possible)
  • Help text should be descriptive.
  • Flag names should use dashes (-) as word separators rather than underscores (_).

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow should be required, the maintainer team should be notified.

Bug fixes

  • There should be at least one unit or end-to-end test.
  • The Pull Request description should include a link to an issue that describes the bug.

Non-trivial changes

  • There should be some code comments as to why things are implemented the way they are.

New/Existing features

  • Should be documented, either by modifying the existing documentation or creating new documentation.
  • New features should have a link to a feature request issue or an RFC that documents the use cases, corner cases and test cases.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • vtctl command output order should be stable and awk-able.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from VTop, if used there.

@maxenglander maxenglander changed the title Maxeng vrep prom metrics workflow status go/{stats,vt}: publish VReplicationStreamState to prometheus backend Mar 30, 2023
Signed-off-by: Max Englander <max@planetscale.com>
@maxenglander maxenglander force-pushed the maxeng-vrep-prom-metrics-workflow-status branch from 5427609 to e5b245b Compare March 30, 2023 06:12
@maxenglander maxenglander added Component: VReplication Component: Observability Pull requests that touch tracing/metrics/monitoring Type: Enhancement Logical improvement (somewhere between a bug and feature) labels Mar 30, 2023
@maxenglander maxenglander marked this pull request as ready for review March 30, 2023 07:18
Copy link
Copy Markdown
Member

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the new metrics! I don't think we need the new stats code though. Let me know if I'm missing something.

Comment on lines +124 to +137
// StringMapFuncWithMultiLabels is a multidimensional string map publisher.
//
// Map keys are compound names made with joining multiple strings with '.',
// and are named by corresponding key labels.
//
// Map values are any string, and are named by the value label.
//
// Since the map is returned by the function, we assume it's in the right
// format (meaning each key is of the form 'aaa.bbb.ccc' with as many elements
// as there are in Labels).
//
// Backends which need to provide a numeric value can set a constant value of 1
// (or whatever is appropriate for the backend) for each key-value pair present
// in the map.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why we can't use a variation of this:

	stats.NewGaugesFuncWithMultiLabels(
		"VReplicationLagSeconds",
		"vreplication seconds behind primary per stream",
		[]string{"source_keyspace", "source_shard", "workflow", "counts"},
		func() map[string]int64 {
			st.mu.Lock()
			defer st.mu.Unlock()
			result := make(map[string]int64, len(st.controllers))
			for _, ct := range st.controllers {
				result[ct.source.Keyspace+"."+ct.source.Shard+"."+ct.workflow+"."+fmt.Sprintf("%v", ct.id)] = ct.blpStats.ReplicationLagSeconds.Load()
			}
			return result
		})

I don'ts how it's any different. Using the existing code would eliminate a lot of the code in this PR.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @mattlord the difference is that with the current code, it does not change the format of the data exported to expvars. If we follow your recommendation it will change the shape of the expvars JSON.

I wasn't sure if we wanted to make that change, and, if we do that, whether we should treat it as a breaking change.

If you aren't concerned with changing the shape of the JSON, I'm all for your suggestion! Let me know.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing JSON should always be non-breaking. But I don't understand what you mean there. Can you share an example?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are the local changes I made which is how I understood your suggestion:


diff --git a/go/vt/vttablet/tabletmanager/vreplication/stats.go b/go/vt/vttablet/tabletmanager/vreplication/stats.go
index 58c958ce35..b1a5f3250e 100644
--- a/go/vt/vttablet/tabletmanager/vreplication/stats.go
+++ b/go/vt/vttablet/tabletmanager/vreplication/stats.go
@@ -62,19 +62,18 @@ type vrStats struct {
 func (st *vrStats) register() {
        stats.NewGaugeFunc("VReplicationStreamCount", "Number of vreplication streams", st.numControllers)
        stats.NewGaugeFunc("VReplicationLagSecondsMax", "Max vreplication seconds behind primary", st.maxReplicationLagSeconds)
-       stats.NewStringMapFuncWithMultiLabels(
+       stats.NewGaugesFuncWithMultiLabels(
                "VReplicationStreamState",
                "State of vreplication workflow",
-               []string{"workflow", "counts"},
-               "state",
-               func() map[string]string {
+               []string{"workflow", "counts", "state"},
+               func() map[string]int64 {
                        st.mu.Lock()
                        defer st.mu.Unlock()
-                       result := make(map[string]string, len(st.controllers))
+                       result := make(map[string]int64, len(st.controllers))
                        for _, ct := range st.controllers {
                                state := ct.blpStats.State.Load()
                                if state != nil {
-                                       result[ct.workflow+"."+fmt.Sprintf("%v", ct.id)] = state.(string)
+                                       result[ct.workflow+"."+fmt.Sprintf("%v", ct.id)+"."+state.(string)] = 1
                                }
                        }
                        return result

Here is what the expvar stats look like with this change:

"VReplicationStreamState": {
  "commerce2customer.1.Running": 1
},

Which is different from the current JSON shape in main and this PR in its current form:

"VReplicationStreamState": {
  "commerce2customer.1": "Running"
}

Copy link
Copy Markdown
Member

@mattlord mattlord Mar 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. A workflow can have N streams and the controller ID is for a given stream. That part makes sense either way. What do we want to do with this metric? Do we want to check the count of ones where the status is e.g. Stopped? If so, I feel like the first output might be easier for that in e.g. promql or in grafana. Maybe not though?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @mattlord the Prometheus output is different from the expvar output format.

The change you recommended above outputs this expvar format:

"VReplicationStreamState": {
  "commerce2customer.1.Running": 1
},

Both this PR and main outputs this expvar format:

"VReplicationStreamState": {
  "commerce2customer.1": "Running"
}

Both this PR and the change you recommended will output the same Prometheus output:

# HELP vttablet_v_replication_stream_state State of vreplication workflow
# TYPE vttablet_v_replication_stream_state gauge
vttablet_v_replication_stream_state{counts="1",state="Running",workflow="commerce2customer"} 1

So with both this PR and the change you recommended, we'll be able to accomplish what we want with Prometheus just as well. The difference between your recommendation and the PR in its current form is that your recommendation will change the shape of expvars output, whereas this PR maintains its current form.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we might be getting hung up on the specifics of the function when I was merely asking why we couldn't use stats.NewGaugesFuncWithMultiLabels(). It's not clear to me why we can't get the same output you desire and produced in the PR as-is using that existing function. I apologize if I'm being dense and missing something obvious. 🙂

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattlord we totally can use stats.NewGaugesFuncWithMultiLabels(), and it will do exactly what I want in terms of giving me nice Prometheus output. However, it will change the shape of the existing expvars JSON, which we may be fine with or we may not want, your call. The shape of the expvars JSON will change because that's just the nature of NewGaugesFuncWithMultiLabels.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! After chatting on Slack we cleared up my confusion.

@mattlord mattlord removed the NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work label Mar 30, 2023
@maxenglander maxenglander requested a review from mattlord March 30, 2023 16:35
Copy link
Copy Markdown
Member

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ Thanks, @maxenglander !

Comment on lines +124 to +137
// StringMapFuncWithMultiLabels is a multidimensional string map publisher.
//
// Map keys are compound names made with joining multiple strings with '.',
// and are named by corresponding key labels.
//
// Map values are any string, and are named by the value label.
//
// Since the map is returned by the function, we assume it's in the right
// format (meaning each key is of the form 'aaa.bbb.ccc' with as many elements
// as there are in Labels).
//
// Backends which need to provide a numeric value can set a constant value of 1
// (or whatever is appropriate for the backend) for each key-value pair present
// in the map.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! After chatting on Slack we cleared up my confusion.

@mattlord mattlord removed the NeedsWebsiteDocsUpdate What it says label Mar 30, 2023
Copy link
Copy Markdown
Member

@rohit-nayak-ps rohit-nayak-ps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

Signed-off-by: Max Englander <max@planetscale.com>
@mattlord mattlord merged commit 9387525 into vitessio:main Mar 31, 2023
@mattlord mattlord deleted the maxeng-vrep-prom-metrics-workflow-status branch March 31, 2023 01:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component: Observability Pull requests that touch tracing/metrics/monitoring Component: VReplication Type: Enhancement Logical improvement (somewhere between a bug and feature)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: improved Prometheus metrics for vreplication workflow states

3 participants