Limit concurrent creation of healthcheck gRPC connections#15053
Limit concurrent creation of healthcheck gRPC connections#15053vmg merged 7 commits intovitessio:mainfrom
Conversation
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
|
On 2nd thought, |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #15053 +/- ##
==========================================
- Coverage 67.41% 65.44% -1.98%
==========================================
Files 1560 1562 +2
Lines 192752 193677 +925
==========================================
- Hits 129952 126751 -3201
- Misses 62800 66926 +4126 ☔ View full report in Codecov by Sentry. |
cb2638e to
24142aa
Compare
|
should a VTGate be cell-bound then to connect to ~10k vttablets? |
@harshit-gangal the Wherever possible our |
39e99ba to
223967b
Compare
There was a problem hiding this comment.
If other parts of the code could benefit from a similar limit, perhaps this should be moved to a helper/util library or baked into grpcclient? 🤔 cc @deepthi (and anyone else) for thoughts
66555dc to
fdae21d
Compare
go/vt/grpcclient/client.go
Outdated
There was a problem hiding this comment.
Added to fix data race:
WARNING: DATA RACE
Read at 0x000003e57f70 by goroutine 71:
vitess.io/vitess/go/vt/grpcclient.RegisterGRPCDialOptions()
/Users/tim/github/vitess/go/vt/grpcclient/client.go:89 +0xdd
vitess.io/vitess/go/vt/discovery.(*tabletHealthCheck).connectionLocked()
/Users/tim/github/vitess/go/vt/discovery/tablet_health_check.go:164 +0x68
vitess.io/vitess/go/vt/discovery.(*tabletHealthCheck).Connection()
/Users/tim/github/vitess/go/vt/discovery/tablet_health_check.go:144 +0xb5
vitess.io/vitess/go/vt/discovery.(*tabletHealthCheck).stream()
/Users/tim/github/vitess/go/vt/discovery/tablet_health_check.go:128 +0x44
vitess.io/vitess/go/vt/discovery.(*tabletHealthCheck).checkConn()
/Users/tim/github/vitess/go/vt/discovery/tablet_health_check.go:295 +0x6ee
vitess.io/vitess/go/vt/discovery.(*HealthCheckImpl).AddTablet.func2()
/Users/tim/github/vitess/go/vt/discovery/healthcheck.go:426 +0x44
Previous write at 0x000003e57f70 by goroutine 69:
vitess.io/vitess/go/vt/grpcclient.RegisterGRPCDialOptions()
/Users/tim/github/vitess/go/vt/grpcclient/client.go:89 +0x184
vitess.io/vitess/go/vt/discovery.(*tabletHealthCheck).connectionLocked()
/Users/tim/github/vitess/go/vt/discovery/tablet_health_check.go:164 +0x68
vitess.io/vitess/go/vt/discovery.(*tabletHealthCheck).Connection()
/Users/tim/github/vitess/go/vt/discovery/tablet_health_check.go:144 +0xb5
vitess.io/vitess/go/vt/discovery.(*tabletHealthCheck).stream()
/Users/tim/github/vitess/go/vt/discovery/tablet_health_check.go:128 +0x44
vitess.io/vitess/go/vt/discovery.(*tabletHealthCheck).checkConn()
/Users/tim/github/vitess/go/vt/discovery/tablet_health_check.go:295 +0x6ee
vitess.io/vitess/go/vt/discovery.(*HealthCheckImpl).AddTablet.func2()
/Users/tim/github/vitess/go/vt/discovery/healthcheck.go:426 +0x44
Goroutine 71 (running) created at:
vitess.io/vitess/go/vt/discovery.(*HealthCheckImpl).AddTablet()
/Users/tim/github/vitess/go/vt/discovery/healthcheck.go:426 +0xad5
vitess.io/vitess/go/vt/discovery.TestHealthCheckErrorOnPrimaryAfterExternalReparent()
/Users/tim/github/vitess/go/vt/discovery/healthcheck_test.go:358 +0x971
testing.tRunner()
/usr/local/Cellar/go/1.21.7/libexec/src/testing/testing.go:1595 +0x261
testing.(*T).Run.func1()
/usr/local/Cellar/go/1.21.7/libexec/src/testing/testing.go:1648 +0x44
Goroutine 69 (running) created at:
vitess.io/vitess/go/vt/discovery.(*HealthCheckImpl).AddTablet()
/Users/tim/github/vitess/go/vt/discovery/healthcheck.go:426 +0xad5
vitess.io/vitess/go/vt/discovery.TestHealthCheckErrorOnPrimaryAfterExternalReparent()
/Users/tim/github/vitess/go/vt/discovery/healthcheck_test.go:351 +0x5cc
testing.tRunner()
/usr/local/Cellar/go/1.21.7/libexec/src/testing/testing.go:1595 +0x261
testing.(*T).Run.func1()
/usr/local/Cellar/go/1.21.7/libexec/src/testing/testing.go:1648 +0x44Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
4bda1f6 to
a5316d7
Compare
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
|
This PR is slowing down the parallel vttablet connections being opened but still has the 10k limit on the thread count. I found that |
harshit-gangal
left a comment
There was a problem hiding this comment.
Changes look fine to me. An issue and adding release notes would be good.
@harshit-gangal thanks! vitessio/website#1696 adds this to the 3 x binaries that changed |
@harshit-gangal we considered raising the max threads limit but tried to avoid that in case it allowed a different problem to spiral out of control, also it doesn't scale as easily with growth unless it's cranked to a very high value We've been using the Related, we've noticed a blip of |
vmg
left a comment
There was a problem hiding this comment.
LGTM. There is actually no way in the GRPC options to limit the amount of concurrent dials so the custom semaphore seems fine.
@timvaillancourt I don't think you need to update the website directly: you're supposed to add this to the release notes in this repository, as part of the PR, and the website will get updated once we release the new Vitess version. |
Thanks @vmg, I'll add to the release notes in this PR 👍 I think the |
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
|
@harshit-gangal / @vmg I believe the doc changes are ready 🙇 |
|
Thanks Tim! LGTM. I'm 99% sure the flags section in the website is auto-generated now. @frouioui will know for sure. Need +1 to merge @harshit-gangal |
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
|
@timvaillancourt now it needs a conflict resolution. |
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
…5053) Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
…izations (#227) * Load `--grpc_auth_static_client_creds` file once (vitessio#15030) Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * Filter by keyspace earlier in `tabletgateway`s `WaitForTablets(...)` (vitessio#15347) Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * Limit concurrent creation of healthcheck gRPC connections (vitessio#15053) Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * go mod tidy Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * Update MySQL apt package and GPG signature (vitessio#14785) Signed-off-by: Matt Lord <mattalord@gmail.com> * remove unrelated workflow files from v20 Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> --------- Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> Signed-off-by: Matt Lord <mattalord@gmail.com> Co-authored-by: Matt Lord <mattalord@gmail.com>
…5053) Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
* Make `Durabler` interface methods public (vitessio#15548) Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> Signed-off-by: Manan Gupta <manan@planetscale.com> Co-authored-by: Manan Gupta <manan@planetscale.com> * Load `--grpc_auth_static_client_creds` file once (vitessio#15030) Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * Limit concurrent creation of healthcheck gRPC connections (vitessio#15053) Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * Filter by keyspace earlier in `tabletgateway`s `WaitForTablets(...)` (vitessio#15347) Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * Use slack-15.0 as previous release Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * empty commit Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * force ci to run * Update GH Action runners Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * test templates Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * set GH access token in build Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * Fix reparent old tests Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * Remove CIs we don't need Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * Remove CIs we don't need again Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * Add private repo setup to upgrade_downgrade_test_backups_e2e.yml Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * Add private repo setup to more CI Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * remove CI skip logic for upstream stuff Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * CODEOWNERS Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * [release-19.0] Add timeout to all the contexts used for RPC calls in vtorc (vitessio#15991) (vitessio#16103) Signed-off-by: Manan Gupta <manan@planetscale.com> * `slack-vitess-r15.0.5`: forward-port consul topo limits PR #111 (#297) * `slack-vitess-r14.0.5`: allow conn overrides in consul topo (#111) * `slack-vitess-r14.0.5`: allow conn overrides in consul topo Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * fix e2e test Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> --------- Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * Update flags tests that didn't exist in v14 Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> --------- Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * update vtcombo e2e Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * Fix err with installing percona-xtrabackup-24 Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * `slack-vitess-r15.0.5`: fix races in `Unit Test (Race)` CI, fix "old" reparent CIs (#356) * update vtcombo e2e test Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * Fix bad merge conflict fix Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * go mod tidy Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * update vtcombo e2e test again Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * [release-19.0] Upgrade the Golang version to `go1.22.5` (vitessio#16322) Signed-off-by: GitHub <noreply@github.com> Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr> Co-authored-by: frouioui <frouioui@users.noreply.github.com> Co-authored-by: Florent Poinsard <florent.poinsard@outlook.fr> * merge conflict fixes Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * make vtadmin_web_proto_types Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> --------- Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> Signed-off-by: Manan Gupta <manan@planetscale.com> Signed-off-by: GitHub <noreply@github.com> Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr> Co-authored-by: Manan Gupta <manan@planetscale.com> Co-authored-by: Manan Gupta <35839558+GuptaManan100@users.noreply.github.com> Co-authored-by: vitess-bot <139342327+vitess-bot@users.noreply.github.com> Co-authored-by: frouioui <frouioui@users.noreply.github.com> Co-authored-by: Florent Poinsard <florent.poinsard@outlook.fr>
Description
This PR adds a concurrency limit for the number of gRPC connections created by the healthcheck handler of
vtgate, in order to avoid hittingruntime: program exceeds 10000-thread limitpanicsRelated, in #15030 I was able to remove 1 x sycall/stream (for gRPC client creds), however this is not enough to avoid hitting panics - now we're seeing raw network syscalls as the main thread-usage offenders
Hot syscall # 1 (DNS due to new gRPC connection to tablet)
Hot syscall # 2 (same - gRPC connecting to tablet)
Hot syscall # 3
I lost this one unfortunately, but it was a
syscall.Openfor the client TCP connection openRelated Issue(s)
--grpc_auth_static_client_credsfile once #15030Checklist
Deployment Notes