Skip to content

Add vstream metrics to vtgate#13098

Merged
mattlord merged 3 commits intovitessio:mainfrom
twthorn:vstream-metrics-for-vtgate
May 19, 2023
Merged

Add vstream metrics to vtgate#13098
mattlord merged 3 commits intovitessio:mainfrom
twthorn:vstream-metrics-for-vtgate

Conversation

@twthorn
Copy link
Copy Markdown
Contributor

@twthorn twthorn commented May 16, 2023

Description

Add vstream metrics (number of vstreams, vstream lag) for vtgate.

PR to update docs: vitessio/website#1473

Related Issue(s)

Fixes #13099

Checklist

  • "Backport to:" labels have been added if this change should be back-ported
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on the CI
  • Documentation was added or is not required

Deployment Notes

Signed-off-by: twthorn <thomaswilliamthornton@gmail.com>
@vitess-bot vitess-bot bot added the NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work label May 16, 2023
@vitess-bot
Copy link
Copy Markdown
Contributor

vitess-bot bot commented May 16, 2023

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • If this is a change that users need to know about, please apply the release notes (needs details) label so that merging is blocked unless the summary release notes document is included.
  • If a test is added or modified, there should be a documentation on top of the test to explain what the expected behavior is what the test does.

If a new flag is being introduced:

  • Is it really necessary to add this flag?
  • Flag names should be clear and intuitive (as far as possible)
  • Help text should be descriptive.
  • Flag names should use dashes (-) as word separators rather than underscores (_).

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow should be required, the maintainer team should be notified.

Bug fixes

  • There should be at least one unit or end-to-end test.
  • The Pull Request description should include a link to an issue that describes the bug.

Non-trivial changes

  • There should be some code comments as to why things are implemented the way they are.

New/Existing features

  • Should be documented, either by modifying the existing documentation or creating new documentation.
  • New features should have a link to a feature request issue or an RFC that documents the use cases, corner cases and test cases.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • vtctl command output order should be stable and awk-able.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from VTop, if used there.

@vitess-bot vitess-bot bot added NeedsIssue A linked issue is missing for this Pull Request NeedsWebsiteDocsUpdate What it says labels May 16, 2023
@github-actions github-actions bot added this to the v17.0.0 milestone May 16, 2023
@frouioui frouioui added Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: Cluster management and removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says NeedsIssue A linked issue is missing for this Pull Request labels May 17, 2023
Copy link
Copy Markdown
Member

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @twthorn ! ❤️

I only had a few minor comments/requests. I'll approve so that you can address those while we wait for the second reviewer.

}
}

func TestVStreamMetrics(t *testing.T) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 👍

@mattlord
Copy link
Copy Markdown
Member

mattlord commented May 17, 2023

@twthorn one other thing... we do need to document these new metrics. Can you please open a website PR here: https://github.com/vitessio/website for that? We would add them to this file: https://vitess.io/docs/17.0/reference/vreplication/metrics/ And we could potentially mention them here: https://vitess.io/docs/17.0/reference/vreplication/vstream/

Once the PR is created please link it in the description here as well. Then the next reviewer will see that this related and required work is also done.

Thanks again!

P.S. Let me know if you need any help with the docs PR.

Signed-off-by: twthorn <thomaswilliamthornton@gmail.com>
@mattlord
Copy link
Copy Markdown
Member

@twthorn it looks like the new unit test may be flaky:

--- FAIL: TestVStreamsCreatedAndLagMetrics (0.00s)
    vstream_manager_test.go:379: 
        	Error Trace:	/home/runner/work/vitess/vitess/go/vt/vtgate/vstream_manager_test.go:379
        	Error:      	Not equal: 
        	            	expected: map[string]int64{"TestVStream.-20.PRIMARY":1, "TestVStream.20-40.PRIMARY":1}
        	            	actual  : map[string]int64{"TestVStream.-20.PRIMARY":4, "TestVStream.20-40.PRIMARY":3, "TestVStreamSkew-0.-20.PRIMARY":1, "TestVStreamSkew-0.20-40.PRIMARY":1, "TestVStreamSkew-1.-20.PRIMARY":1, "TestVStreamSkew-1.20-40.PRIMARY":1, "TestVStreamSkew-2.20-40.PRIMARY":1, "TestVStreamSkew-3.-20.PRIMARY":1}
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1,4 +1,10 @@
        	            	-(map[string]int64) (len=2) {
        	            	- (string) (len=23) "TestVStream.-20.PRIMARY": (int64) 1,
        	            	- (string) (len=25) "TestVStream.20-40.PRIMARY": (int64) 1
        	            	+(map[string]int64) (len=8) {
        	            	+ (string) (len=23) "TestVStream.-20.PRIMARY": (int64) 4,
        	            	+ (string) (len=25) "TestVStream.20-40.PRIMARY": (int64) 3,
        	            	+ (string) (len=29) "TestVStreamSkew-0.-20.PRIMARY": (int64) 1,
        	            	+ (string) (len=31) "TestVStreamSkew-0.20-40.PRIMARY": (int64) 1,
        	            	+ (string) (len=29) "TestVStreamSkew-1.-20.PRIMARY": (int64) 1,
        	            	+ (string) (len=31) "TestVStreamSkew-1.20-40.PRIMARY": (int64) 1,
        	            	+ (string) (len=31) "TestVStreamSkew-2.20-40.PRIMARY": (int64) 1,
        	            	+ (string) (len=29) "TestVStreamSkew-3.-20.PRIMARY": (int64) 1
        	            	 }
        	Test:       	TestVStreamsCreatedAndLagMetrics
        	Messages:   	vstreamsCreated matches
    vstream_manager_test.go:384: 
        	Error Trace:	/home/runner/work/vitess/vitess/go/vt/vtgate/vstream_manager_test.go:384
        	Error:      	Not equal: 
        	            	expected: map[string]int64{"TestVStream.-20.PRIMARY":5, "TestVStream.20-40.PRIMARY":7}
        	            	actual  : map[string]int64{"TestVStream.-20.PRIMARY":5, "TestVStream.20-40.PRIMARY":7, "TestVStreamSkew-0.-20.PRIMARY":2, "TestVStreamSkew-0.20-40.PRIMARY":3, "TestVStreamSkew-1.-20.PRIMARY":2, "TestVStreamSkew-1.20-40.PRIMARY":2, "TestVStreamSkew-2.20-40.PRIMARY":2, "TestVStreamSkew-3.-20.PRIMARY":2}
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1,4 +1,10 @@
        	            	-(map[string]int64) (len=2) {
        	            	+(map[string]int64) (len=8) {
        	            	  (string) (len=23) "TestVStream.-20.PRIMARY": (int64) 5,
        	            	- (string) (len=25) "TestVStream.20-40.PRIMARY": (int64) 7
        	            	+ (string) (len=25) "TestVStream.20-40.PRIMARY": (int64) 7,
        	            	+ (string) (len=29) "TestVStreamSkew-0.-20.PRIMARY": (int64) 2,
        	            	+ (string) (len=31) "TestVStreamSkew-0.20-40.PRIMARY": (int64) 3,
        	            	+ (string) (len=29) "TestVStreamSkew-1.-20.PRIMARY": (int64) 2,
        	            	+ (string) (len=31) "TestVStreamSkew-1.20-40.PRIMARY": (int64) 2,
        	            	+ (string) (len=31) "TestVStreamSkew-2.20-40.PRIMARY": (int64) 2,
        	            	+ (string) (len=29) "TestVStreamSkew-3.-20.PRIMARY": (int64) 2
        	            	 }
        	Test:       	TestVStreamsCreatedAndLagMetrics
        	Messages:   	vstreamsLag matches

It may have been a knock on failure from a previous test though. Does this pass 20 times in a row locally for you?

go test -count 1 -race -timeout 2m -run ^TestVStreamsCreatedAndLagMetrics$ vitess.io/vitess/go/vt/vtgate

That's an example of what I would normally do to try and avoid introducing a new flaky test. Along with re-running the relevant workflow in the CI 5+ times.

@mattlord
Copy link
Copy Markdown
Member

So running the test by itself is fine. But if we run all of them from the package then I can see the data races and failures locally:

go test -count 1 -race -timeout 5m vitess.io/vitess/go/vt/vtgate

The data races, I believe from my local testing, are fixed with this patch if you want to add it here:

index 746bc78333..fd380ddf3b 100644
--- a/go/vt/vtgate/vstream_manager_test.go
+++ b/go/vt/vtgate/vstream_manager_test.go
@@ -1208,7 +1208,8 @@ func startVStream(ctx context.Context, t *testing.T, vsm *vstreamManager, vgtid
 func verifyEvents(t *testing.T, ch <-chan *binlogdatapb.VStreamResponse, wants ...*binlogdatapb.VStreamResponse) {
        t.Helper()
        for i, want := range wants {
-               got := <-ch
+               val := <-ch
+               got := proto.Clone(val).(*binlogdatapb.VStreamResponse)
                require.NotNil(t, got)
                for _, event := range got.Events {
                        event.Timestamp = 0

Then that leaves us with the fact that we seem to be reading unprocessed messages left over in the vstream from the previous Skew test. We'll have to either drain the channel first, or accept/ignore additional values.

Signed-off-by: twthorn <thomaswilliamthornton@gmail.com>
@twthorn
Copy link
Copy Markdown
Contributor Author

twthorn commented May 18, 2023

@mattlord Thank you for the help! Updated with your suggestions

@twthorn twthorn requested a review from mattlord May 18, 2023 22:17
Copy link
Copy Markdown
Member

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work on this @twthorn ! Thank you for the contribution ❤️

P.S. I confirmed locally that the unit tests with -race are good now!

@mattlord
Copy link
Copy Markdown
Member

I'm going to override the test requirements and merge, as Run endtoend tests on Cluster (tabletmanager_throttler_custom_config) Expected — Waiting for status to be reported is lost in the weeds (happens sometimes with Actions) and the only way to resolve it is with another push. Since it's unrelated to this PR I'm going to avoid that extra CI round.

@mattlord mattlord merged commit d7f86b3 into vitessio:main May 19, 2023
twthorn added a commit to slackhq/vitess that referenced this pull request May 19, 2023
* Add vstream metrics to vtgate

Signed-off-by: twthorn <thomaswilliamthornton@gmail.com>

* Update unit test name and use cell variable

Signed-off-by: twthorn <thomaswilliamthornton@gmail.com>

* Reset metrics for TestVStreamsCreatedAndLagMetrics, fix data race issue

Signed-off-by: twthorn <thomaswilliamthornton@gmail.com>

---------

Signed-off-by: twthorn <thomaswilliamthornton@gmail.com>
timvaillancourt pushed a commit to slackhq/vitess that referenced this pull request May 20, 2023
* Add vstream metrics to vtgate



* Update unit test name and use cell variable



* Reset metrics for TestVStreamsCreatedAndLagMetrics, fix data race issue



---------

Signed-off-by: twthorn <thomaswilliamthornton@gmail.com>
timvaillancourt pushed a commit to slackhq/vitess that referenced this pull request May 21, 2024
* Add vstream metrics to vtgate

Signed-off-by: twthorn <thomaswilliamthornton@gmail.com>

* Update unit test name and use cell variable

Signed-off-by: twthorn <thomaswilliamthornton@gmail.com>

* Reset metrics for TestVStreamsCreatedAndLagMetrics, fix data race issue

Signed-off-by: twthorn <thomaswilliamthornton@gmail.com>

---------

Signed-off-by: twthorn <thomaswilliamthornton@gmail.com>
timvaillancourt added a commit to slackhq/vitess that referenced this pull request May 22, 2024
* Add `VStreamerCount` stat to `vttablet` (vitessio#11978)

* Add `VStreamersActive` stat to `vttablet`

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* Improve desc

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* Add PR suggestions

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* Move to *stats.Gauge

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* Single defer

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* Add `Uptime` metric (vitessio#12712)

* Add `Uptime` metric to `vtgate`+`vttablet`

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* move to go/vt/servenv/status.go

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* Use nanoseconds for uptime

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* Move Uptime metrics to servenv.go, remove dupe start time.Time

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* Use serverStart time.Time

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

---------

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* Implement the RowsColumnTypeScanType interface in the go sql driver for Vitess in order to get the column types (vitessio#12007)

Signed-off-by: Johan Oskarsson <joskarsson@slack-corp.com>

Signed-off-by: Johan Oskarsson <joskarsson@slack-corp.com>
Co-authored-by: Johan Oskarsson <joskarsson@slack-corp.com>

* Add vstream metrics to vtgate (vitessio#13098)

* Add vstream metrics to vtgate

Signed-off-by: twthorn <thomaswilliamthornton@gmail.com>

* Update unit test name and use cell variable

Signed-off-by: twthorn <thomaswilliamthornton@gmail.com>

* Reset metrics for TestVStreamsCreatedAndLagMetrics, fix data race issue

Signed-off-by: twthorn <thomaswilliamthornton@gmail.com>

---------

Signed-off-by: twthorn <thomaswilliamthornton@gmail.com>

* add lock to flaky TestValidateVersionShard test

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* typo

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

---------

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Johan Oskarsson <joskarsson@slack-corp.com>
Signed-off-by: twthorn <thomaswilliamthornton@gmail.com>
Co-authored-by: Johan Oskarsson <johan@oskarsson.nu>
Co-authored-by: Johan Oskarsson <joskarsson@slack-corp.com>
Co-authored-by: Thomas Thornton <thomaswilliamthornton@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component: Cluster management Type: Enhancement Logical improvement (somewhere between a bug and feature)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Add metrics to vtgate to determine lag and number of vstreams assigned

4 participants