Tablet throttler: check replicas via HTTP rather than via MySQL#9334
Tablet throttler: check replicas via HTTP rather than via MySQL#9334shlomi-noach merged 4 commits intovitessio:mainfrom
Conversation
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
| mySQLThrottleMetric.ClusterName = clusterName | ||
| mySQLThrottleMetric.Key = probe.Key | ||
|
|
||
| tabletCheckSelfURL := fmt.Sprintf("http://%s:%d/throttler/check-self?app=vitess", probe.TabletHost, probe.TabletPort) |
There was a problem hiding this comment.
Can a tablet serve https? Can it expect TLS? I'm sensing this http:// will break sometime.
There was a problem hiding this comment.
It could support https but it doesn't today. I don't see ListenAndServeTLS or ServeTLS used anywhere today in the codebase (except for orchestrator). It's a valid point though — at some point I think it's worth doing an "https everywhere" project that at least adds support for using it everywhere and makes the protocol used a variable when we form it like this, which we do in many other places already:
$ grep -R "http://" go/vt | grep -Ev "apache|colorbar|test.go|rice|Ref|stack|blogs|Advert|foo|orc|refman|google"
go/vt/vtgate/executor.go: resp, err := client.Get(fmt.Sprintf("http://%s/throttler/check?app=vtgate", tabletHostPort))
go/vt/vtgate/status.go: <td><a href="http://{{$status.Addr}}">{{$status.Name}}</a></td>
go/vt/vttablet/onlineddl/executor.go: curl --max-time 10 -s 'http://localhost:%d/schema-migration/report-status?uuid=%s&status=%s&dryrun='"$GH_OST_DRY_RUN"'&progress='"$GH_OST_PROGRESS"'&eta='"$GH_OST_ETA_SECONDS"'&rowscopied='"$GH_OST_COPIED_ROWS"
go/vt/vttablet/onlineddl/executor.go: fmt.Sprintf(`--throttle-http=http://localhost:%d/throttler/check?app=online-ddl:gh-ost:%s&p=low`, *servenv.Port, onlineDDL.UUID),
go/vt/vttablet/onlineddl/executor.go: get("http://localhost:{{VTTABLET_PORT}}/schema-migration/report-status?uuid={{MIGRATION_UUID}}&status={{OnlineDDLStatusRunning}}&dryrun={{DRYRUN}}");
go/vt/vttablet/onlineddl/executor.go: get("http://localhost:{{VTTABLET_PORT}}/schema-migration/report-status?uuid={{MIGRATION_UUID}}&status={{OnlineDDLStatusComplete}}&dryrun={{DRYRUN}}");
go/vt/vttablet/onlineddl/executor.go: get("http://localhost:{{VTTABLET_PORT}}/schema-migration/report-status?uuid={{MIGRATION_UUID}}&status={{OnlineDDLStatusFailed}}&dryrun={{DRYRUN}}");
go/vt/vttablet/onlineddl/executor.go: if (head("http://localhost:{{VTTABLET_PORT}}/throttler/check?app=online-ddl:pt-osc:{{MIGRATION_UUID}}&p=low")) {
go/vt/vttablet/endtoend/framework/server.go: ServerAddress = fmt.Sprintf("http://%s", ln.Addr().String())
go/vt/wrangler/keyspace.go: webURL = fmt.Sprintf("http://%v:%d/", ts.Tablet.Hostname, webPort)
go/vt/wrangler/version.go: resp, err := http.Get("http://" + tabletAddr + "/debug/vars")
go/vt/vtctld/api.go: tablet.URL = "http://" + netutil.JoinHostPort(t.Hostname, t.PortMap["vt"])
go/vt/discovery/legacy_healthcheck.go:// http://{{.GetTabletHostPort}} -> http://host.dc.domain:22
go/vt/discovery/healthcheck.go: TabletURLTemplateString = flag.String("tablet_url_template", "http://{{.GetTabletHostPort}}", "format string describing debug tablet url formatting. See the Go code for getTabletDebugURL() how to customize this.")
go/vt/discovery/tablet_health.go:// http://{{.GetTabletHostPort}} -> http://host.dc.domain:22
go/vt/vitessdriver/doc.go:For more information, visit http://www.vitess.io.
go/vt/vttest/vtprocess.go: url := fmt.Sprintf("http://%s/debug/vars", addr)
go/vt/vtadmin/README.md:# "http://127.0.0.1:14200".
|
Assuming this PR is accepted and goes well, we will follow up with another PR to remove existing direct-mysql-access code, and probably refactor a bit as cleanup. |
| mySQLThrottleMetric.Key = probe.Key | ||
|
|
||
| tabletCheckSelfURL := fmt.Sprintf("http://%s:%d/throttler/check-self?app=vitess", probe.TabletHost, probe.TabletPort) | ||
| resp, err := throttler.httpClient.Get(tabletCheckSelfURL) |
There was a problem hiding this comment.
Should we have a timeout? Maybe interval * 0.5 or something?
There was a problem hiding this comment.
| } | ||
| mySQLThrottleMetric.Value = checkResult.Value | ||
|
|
||
| if checkResult.StatusCode == http.StatusInternalServerError { |
There was a problem hiding this comment.
Would we ever get here? I would think we would have gone into one of the earlier err blocks. Doesn't hurt, but if we do handle http codes we could handle any of the err codes, 4XX and 5XX or even anything not 2XX.
There was a problem hiding this comment.
Probably 5XX would be handled earlier and still I'd want to surface that. As for the other error codes, these are app-level codes that are generated by the underlying check. There is no need to propagate the error to the upper level, it's enough that we propagate the Value. In fact, it hurts if we report back an error, where there is no error per-se, just an app-level decision that "this check is a no-go". It's the upper layer, the collector's responsibility to collect all values and decide if the total state is "go" or "no-go".
The introduction of this mid HTTP layer complicates things because we now only have implicit/secondary understanding of the nature of errors on the MySQL side.
As it stands, the current logic passes all the tests correctly, which means it complies with existing behavior.
mattlord
left a comment
There was a problem hiding this comment.
Had a couple of minor questions but nothing blocking in any way.
Thank you for working on this!
deepthi
left a comment
There was a problem hiding this comment.
I was wondering about the performance impact, but convinced myself that it should be fine.
Now you made me worried. There is indeed a difference between a persistent MySQL connection and a non-persistent HTTP connection. I am also concerned about the impact. I'm going to increase poll interval from |
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Description
In this PR we override the existing throttler behavior for shard health checks.
A shard is considered healthy if all replicas are lagging within a certain threshold.
Today, the primary tablet acquires the identities of all shard's tablets, filters the relevant tablets, then records the hostname.port of the MySQL servers of those tablets. It continuously polls these MySQL servers directly for replication lag.
In this PR, the primary tablet no longer accesses those MySQL servers directly (the code to do that is still here, but never used). Instead, it accesses the tablet's HTTP interface for
/throttler/check-self. On each replica tablet, that check already reflects the replication lag status of the tablet. So now the primary tablet acquires lag information indirectly via HTTP requests.Related Issue(s)
Checklist