Skip to content

vttablet: heartbeat always enabled#6665

Closed
shlomi-noach wants to merge 14 commits intovitessio:masterfrom
planetscale:enable-vttablet-heartbeat
Closed

vttablet: heartbeat always enabled#6665
shlomi-noach wants to merge 14 commits intovitessio:masterfrom
planetscale:enable-vttablet-heartbeat

Conversation

@shlomi-noach
Copy link
Copy Markdown
Contributor

@shlomi-noach shlomi-noach commented Sep 3, 2020

Closes #6662

  • -heartbeat_enable flag now always assumed to be enabled
  • -heartbeat_interval is now always set to some value in the range (0..1] seconds, default 250ms

-enable_heartbeat flag now always assumed to be enabled
-heartbeat_interval is now always set to some value in the range (0..1] seconds, default 500ms

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach shlomi-noach requested a review from sougou as a code owner September 3, 2020 05:37
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach shlomi-noach mentioned this pull request Sep 9, 2020
7 tasks
@shlomi-noach
Copy link
Copy Markdown
Contributor Author

#6668 depends on this.

@shlomi-noach
Copy link
Copy Markdown
Contributor Author

ping, request for review 🙏

#6668 depends on this.

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
…eartbeat

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach
Copy link
Copy Markdown
Contributor Author

shlomi-noach commented Sep 29, 2020

I believe I understand why endtoend tests are failing, but am going to need assistance, because this touches implicit expectations in Tracker code.

The problem is that TestSchemaVersioning disables tracking upon test startup, then enables it again on end:

tsv.SetTracking(false)

Quote:

// Let's disable the already running tracker to prevent it from
// picking events from the previous test, and then re-enable it at the end.

However, we enabled heartbeat to always run, and heartbeat supersedes the tracker. Since heartbeat still runs, the test gets events from the previous test.

I could, potentially, add a Enable() and Disable() -like functionality to the heartbeat writer, such that tests can control that, too. I'm just not sure if there's deeper implications to the fact that Tracker cannot control its own workflow; or is it just an issue with the test itself. I'm not familiar with the Tracker code, therefore seeking assistance.

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach
Copy link
Copy Markdown
Contributor Author

57d5fe3 addresses the above comment about test failures.

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@dweitzman
Copy link
Copy Markdown
Collaborator

Just checking in, since I haven't been following the recent code changes: it used to be the case that the same command line flag controlled both whether to write heartbeats and whether to disable a tablet as unhealthy if no heartbeat lag could be determined, which created a messy rollout problem where you couldn't have -enable_heartbeat on a replica unless it was already on for the master.

Has that been resolved by recent refactorings so that -enable_heartbeat doesn't immediately force on the heartbeat-based tablet health reporter?

@deepthi
Copy link
Copy Markdown
Collaborator

deepthi commented Sep 29, 2020

Just checking in, since I haven't been following the recent code changes: it used to be the case that the same command line flag controlled both whether to write heartbeats and whether to disable a tablet as unhealthy if no heartbeat lag could be determined, which created a messy rollout problem where you couldn't have -enable_heartbeat on a replica unless it was already on for the master.

Has that been resolved by recent refactorings so that -enable_heartbeat doesn't immediately force on the heartbeat-based tablet health reporter?

That is an excellent point. I believe it is still the case that restarting a replica with -enable_heartbeat will immediately make it non-serving unless the master has it set. We'll need to come up with a way to solve this.

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach
Copy link
Copy Markdown
Contributor Author

I was thinking maybe 00100d4, as a temporary experiment, would fix the failing tests -- seemed it could be related. It did not. Tests fail on same issue. Reverted that commit.

The tests fail like so:

2020-09-30T12:21:37.3916243Z --- FAIL: TestRecovery (112.01s)
2020-09-30T12:21:37.3918764Z     recovery.go:278: 
2020-09-30T12:21:37.3922735Z         	Error Trace:	recovery.go:278
2020-09-30T12:21:37.3923626Z         	Error:      	Received unexpected error:
2020-09-30T12:21:37.3925213Z         	            	wait for recovery_ks1.0.replica failed
2020-09-30T12:21:37.3925981Z         	Test:       	TestRecovery
2020-09-30T12:21:37.3927848Z     recovery.go:280: 
2020-09-30T12:21:37.3928611Z         	Error Trace:	recovery.go:280
2020-09-30T12:21:37.3929728Z         	Error:      	Received unexpected error:
2020-09-30T12:21:37.3930582Z         	            	wait for recovery_ks2.0.replica failed
2020-09-30T12:21:37.3931069Z         	Test:       	TestRecovery
2020-09-30T12:21:37.3931649Z     recovery_util.go:48: 
2020-09-30T12:21:37.3932052Z         	Error Trace:	recovery_util.go:48
2020-09-30T12:21:37.3932688Z         	            				recovery.go:292
2020-09-30T12:21:37.3933280Z         	Error:      	Expected nil, but got: Code: UNAVAILABLE
2020-09-30T12:21:37.3934814Z         	            	vtgate: http://fv-az151.internal.cloudapp.net:16419/: target: recovery_ks1.0.replica: no valid tablet
2020-09-30T12:21:37.3935509Z         	Test:       	TestRecovery
2020-09-30T12:21:37.3935756Z FAIL
2020-09-30T12:21:37.5415886Z FAIL	vitess.io/vitess/go/test/endtoend/recovery/unshardedrecovery	125.254s
2020-09-30T12:21:37.5494021Z FAIL

It seems like vtgate's /debug/vars does not list the restored replica under HealthcheckConnections. I was thinking maybe the aforementioned healthcheck, combined with @dweitzman 's comment, are the source of the problem.

I have no further ideas at this time. Been at this for a couple days now with no progress.

@deepthi
Copy link
Copy Markdown
Collaborator

deepthi commented Sep 30, 2020

That is an excellent point. I believe it is still the case that restarting a replica with -enable_heartbeat will immediately make it non-serving unless the master has it set. We'll need to come up with a way to solve this.

Here's how we could solve the rollout issue with enabling heartbeats.

  • change lag/serving computation in tabletserver (ReplicationTracker) to use both methods of computing lag if both flags are set and report the min
  • deployments that are currently using the polling method (-enable_replication_reporter) can leave that on and add -enable_heartbeat
  • once heartbeat has been enabled on all tablets, a second rollout can disable the polling method

As far as throttler goes (#6668), we can have it implicitly enable heartbeat rather than changing the default.

@shlomi-noach
Copy link
Copy Markdown
Contributor Author

The throttling PR, #6668, now implicitly enables heartbeat, without changing the config. The tests pass and behavior seems correct.

This PR is now a non-blocker.

@deepthi
Copy link
Copy Markdown
Collaborator

deepthi commented Oct 5, 2020

@shlomi-noach I think we decided not to make this the default. Can we close this PR and mark the issue as "Won't Fix"?

@shlomi-noach
Copy link
Copy Markdown
Contributor Author

This will not be implemented at this time. See #6665 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Suggestion: always enable heartbeat on vttablet

3 participants