Auto-set 5s on-demand heartbeat if --enable_heartbeat is disabled#15099
Auto-set 5s on-demand heartbeat if --enable_heartbeat is disabled#15099shlomi-noach wants to merge 4 commits intovitessio:mainfrom
Conversation
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #15099 +/- ##
==========================================
+ Coverage 47.29% 47.72% +0.42%
==========================================
Files 1137 1155 +18
Lines 238684 240275 +1591
==========================================
+ Hits 112895 114673 +1778
+ Misses 117168 116999 -169
+ Partials 8621 8603 -18 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
The description seems pretty geared towards making heartbeats available in examples. With that context in mind, I think changing the default value (and behavior) of Vitess to meet an examples need creates added risk. The default value being dynamic here adds more complexity that will become difficult to reason about when someone encounters behavior they can't explain.
I'm not sure the added risk and complexity is worth it to satisfy an examples need, if we're able to satisfy it just as well by setting an explicit flag in the examples config?
|
@maxenglander I see your point and it is valid. Still, some further thoughts:
Yes and no. As I went back and forth with the configuration, I realized the heartbeat configuration emerged organically but ended up in a bit of a confusing state. Like the fact it takes either of two flags to make heartbeats possible, or the fact that one flag ( A different approach could be that, upon activation, the throttler could programmatically enforce the activation of heartbeats. |
| fs.DurationVar(&heartbeatInterval, "heartbeat_interval", 1*time.Second, "How frequently to read and write replication heartbeat.") | ||
| fs.DurationVar(&heartbeatOnDemandDuration, "heartbeat_on_demand_duration", 0, "If non-zero, heartbeats are only written upon consumer request, and only run for up to given duration following the request. Frequent requests can keep the heartbeat running consistently; when requests are infrequent heartbeat may completely stop between requests") | ||
| fs.DurationVar(&heartbeatOnDemandDuration, "heartbeat_on_demand_duration", 0, "If non-zero, heartbeats are only written upon consumer request, and only run for up to given duration following the request. Frequent requests can keep the heartbeat running consistently. Automatically set to 5s when --heartbeat_enable is unset.") |
There was a problem hiding this comment.
should we make the two flags mutually exclusive in pFlag? That might simplify the logic.
There was a problem hiding this comment.
That's a great idea, irrespective of this PR! I only wonder if it's going to break someone's existing setup?
There was a problem hiding this comment.
Yeah, it might... 😢 We could potentially add it to the 19.0.0 changelog as a breaking change though.
Maybe we should take a documentation driven approach here. How would we document the new/proposed configuration here? https://vitess.io/docs/19.0/reference/features/tablet-throttler/#configuration
- IMO we should remove all of the old tablet level config parts/noise there in the v19 docs
- That seems a little misleading now as it says "Enabling the lag throttler also automatically enables heartbeat injection." and it does not note that you must set
--heartbeat_enable=false(which I think was true before given the noted issues reported with the examples) and--heartbeat_on_demand_durationto a non-zero value — although it does describe what it is and recommend values for it
In the meantime I'll open a new PR adding the flag in |
|
Followup in #15204. |
|
Meanwhile converting this to |
|
Please see this followup issue: #15303 |
|
Closing for now. Follow #15303 for related work. |
Description
This work started with #14978 and #14980. And while the discussion was about the
examplessuite, it applies to all environments.After #14980 was merged, we realized
examplescould no longer facilitate the tablet throttler with its default setup. That's because the throttler, by default, relies on replication lag calculated from_vt.heartbeat.And while the throttler can be enabled/disabled dynamically, the heartbeat configuration is set on startup.
Following #14980 we asked ourselves: should the throttler be available for operation in
examples? And we agreed that it should -- same as any other component in Vitess. To that effect, the throttler should be able to read valid heartbeat information.There are multiple ways to go about it, and some considerations are:
examplesare expected to run on local laptops or otherwise small hosts.examplesshould not generate any excessive load./debug/statuspage reported lag due to stale heartbeats, and even though there was no actual lag.The solution in this PR is simple: if
--heartbeat_enableis set, great, heartbeats are on, and nothing else to do. But if it is unset, and neither isheartbeat_on_demand_duration, then we setheartbeat_on_demand_durationto5s. Meaning, there has to be some sort of heartbeat.As a reminder, the rules for
heartbeat_on_demand_durationare:Open().5s, upon request.vreplication) actively asks it for throttling feedback.In short, with this new
5son-demand heartbeat default value, the the throttler will be able to work correctly inexamplesand in any other Vitess setup that is otherwise not configured for heartbeats. Heartbeats will only be generated while the throttler is both enabled and actively engaged with some vreplication workflow or Online DDL operation etc.And, back to
examples, becauseheartbeat_enableis unset, thenheartbeatReaderis not enabled, and as result/debug/statusreports replication lag based on MySQL replication as opposed to based on_vt.heartbeat. So this PR still respects #14978.Related Issue(s)
Backporting
Like #15099, this should be backported to
18.0and17.0and for the same reasons.Checklist
Deployment Notes