Skip to content

Tablet throttler: check replicas via HTTP rather than via MySQL#9334

Merged
shlomi-noach merged 4 commits intovitessio:mainfrom
planetscale:tablet-throttler-http-checks
Dec 8, 2021
Merged

Tablet throttler: check replicas via HTTP rather than via MySQL#9334
shlomi-noach merged 4 commits intovitessio:mainfrom
planetscale:tablet-throttler-http-checks

Conversation

@shlomi-noach
Copy link
Contributor

Description

In this PR we override the existing throttler behavior for shard health checks.
A shard is considered healthy if all replicas are lagging within a certain threshold.

Today, the primary tablet acquires the identities of all shard's tablets, filters the relevant tablets, then records the hostname.port of the MySQL servers of those tablets. It continuously polls these MySQL servers directly for replication lag.

In this PR, the primary tablet no longer accesses those MySQL servers directly (the code to do that is still here, but never used). Instead, it accesses the tablet's HTTP interface for /throttler/check-self. On each replica tablet, that check already reflects the replication lag status of the tablet. So now the primary tablet acquires lag information indirectly via HTTP requests.

Related Issue(s)

Checklist

  • Should this PR be backported?
  • Tests were added or are not required
  • Documentation was added or is not required

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
mySQLThrottleMetric.ClusterName = clusterName
mySQLThrottleMetric.Key = probe.Key

tabletCheckSelfURL := fmt.Sprintf("http://%s:%d/throttler/check-self?app=vitess", probe.TabletHost, probe.TabletPort)
Copy link
Contributor Author

@shlomi-noach shlomi-noach Dec 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can a tablet serve https? Can it expect TLS? I'm sensing this http:// will break sometime.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could support https but it doesn't today. I don't see ListenAndServeTLS or ServeTLS used anywhere today in the codebase (except for orchestrator). It's a valid point though — at some point I think it's worth doing an "https everywhere" project that at least adds support for using it everywhere and makes the protocol used a variable when we form it like this, which we do in many other places already:

$ grep -R "http://" go/vt | grep -Ev "apache|colorbar|test.go|rice|Ref|stack|blogs|Advert|foo|orc|refman|google"
go/vt/vtgate/executor.go:	resp, err := client.Get(fmt.Sprintf("http://%s/throttler/check?app=vtgate", tabletHostPort))
go/vt/vtgate/status.go:    <td><a href="http://{{$status.Addr}}">{{$status.Name}}</a></td>
go/vt/vttablet/onlineddl/executor.go:	curl --max-time 10 -s 'http://localhost:%d/schema-migration/report-status?uuid=%s&status=%s&dryrun='"$GH_OST_DRY_RUN"'&progress='"$GH_OST_PROGRESS"'&eta='"$GH_OST_ETA_SECONDS"'&rowscopied='"$GH_OST_COPIED_ROWS"
go/vt/vttablet/onlineddl/executor.go:			fmt.Sprintf(`--throttle-http=http://localhost:%d/throttler/check?app=online-ddl:gh-ost:%s&p=low`, *servenv.Port, onlineDDL.UUID),
go/vt/vttablet/onlineddl/executor.go:	  get("http://localhost:{{VTTABLET_PORT}}/schema-migration/report-status?uuid={{MIGRATION_UUID}}&status={{OnlineDDLStatusRunning}}&dryrun={{DRYRUN}}");
go/vt/vttablet/onlineddl/executor.go:	    get("http://localhost:{{VTTABLET_PORT}}/schema-migration/report-status?uuid={{MIGRATION_UUID}}&status={{OnlineDDLStatusComplete}}&dryrun={{DRYRUN}}");
go/vt/vttablet/onlineddl/executor.go:	    get("http://localhost:{{VTTABLET_PORT}}/schema-migration/report-status?uuid={{MIGRATION_UUID}}&status={{OnlineDDLStatusFailed}}&dryrun={{DRYRUN}}");
go/vt/vttablet/onlineddl/executor.go:			if (head("http://localhost:{{VTTABLET_PORT}}/throttler/check?app=online-ddl:pt-osc:{{MIGRATION_UUID}}&p=low")) {
go/vt/vttablet/endtoend/framework/server.go:	ServerAddress = fmt.Sprintf("http://%s", ln.Addr().String())
go/vt/wrangler/keyspace.go:		webURL = fmt.Sprintf("http://%v:%d/", ts.Tablet.Hostname, webPort)
go/vt/wrangler/version.go:	resp, err := http.Get("http://" + tabletAddr + "/debug/vars")
go/vt/vtctld/api.go:		tablet.URL = "http://" + netutil.JoinHostPort(t.Hostname, t.PortMap["vt"])
go/vt/discovery/legacy_healthcheck.go:// http://{{.GetTabletHostPort}} -> http://host.dc.domain:22
go/vt/discovery/healthcheck.go:	TabletURLTemplateString = flag.String("tablet_url_template", "http://{{.GetTabletHostPort}}", "format string describing debug tablet url formatting. See the Go code for getTabletDebugURL() how to customize this.")
go/vt/discovery/tablet_health.go:// http://{{.GetTabletHostPort}} -> http://host.dc.domain:22
go/vt/vitessdriver/doc.go:For more information, visit http://www.vitess.io.
go/vt/vttest/vtprocess.go:	url := fmt.Sprintf("http://%s/debug/vars", addr)
go/vt/vtadmin/README.md:# "http://127.0.0.1:14200".

@shlomi-noach
Copy link
Contributor Author

Assuming this PR is accepted and goes well, we will follow up with another PR to remove existing direct-mysql-access code, and probably refactor a bit as cleanup.

mySQLThrottleMetric.Key = probe.Key

tabletCheckSelfURL := fmt.Sprintf("http://%s:%d/throttler/check-self?app=vitess", probe.TabletHost, probe.TabletPort)
resp, err := throttler.httpClient.Get(tabletCheckSelfURL)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a timeout? Maybe interval * 0.5 or something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}
mySQLThrottleMetric.Value = checkResult.Value

if checkResult.StatusCode == http.StatusInternalServerError {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we ever get here? I would think we would have gone into one of the earlier err blocks. Doesn't hurt, but if we do handle http codes we could handle any of the err codes, 4XX and 5XX or even anything not 2XX.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably 5XX would be handled earlier and still I'd want to surface that. As for the other error codes, these are app-level codes that are generated by the underlying check. There is no need to propagate the error to the upper level, it's enough that we propagate the Value. In fact, it hurts if we report back an error, where there is no error per-se, just an app-level decision that "this check is a no-go". It's the upper layer, the collector's responsibility to collect all values and decide if the total state is "go" or "no-go".
The introduction of this mid HTTP layer complicates things because we now only have implicit/secondary understanding of the nature of errors on the MySQL side.
As it stands, the current logic passes all the tests correctly, which means it complies with existing behavior.

Copy link
Member

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a couple of minor questions but nothing blocking in any way.

Thank you for working on this!

Copy link
Collaborator

@deepthi deepthi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering about the performance impact, but convinced myself that it should be fine.

@shlomi-noach
Copy link
Contributor Author

I was wondering about the performance impact, but convinced myself that it should be fine.

Now you made me worried. There is indeed a difference between a persistent MySQL connection and a non-persistent HTTP connection. I am also concerned about the impact. I'm going to increase poll interval from 100ms to 250ms ie reduce frequency by x2.5. Pushing that commit then merging.

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component: Cluster management Type: Enhancement Logical improvement (somewhere between a bug and feature)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants