Throttler: Expose Tablet's Config & Leverage to Deflake Tests#12737
Throttler: Expose Tablet's Config & Leverage to Deflake Tests#12737mattlord merged 19 commits intovitessio:mainfrom
Conversation
For example, we stop replication, wait a few seconds, then expect there to be lag. But vtorc could repair replication during that wait and then the lag is gone. Signed-off-by: Matt Lord <mattalord@gmail.com>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
If a new flag is being introduced:
If a workflow is added or modified:
Bug fixes
Non-trivial changes
New/Existing features
Backward compatibility
|
Signed-off-by: Matt Lord <mattalord@gmail.com>
becac96 to
4ebc2fe
Compare
Signed-off-by: Matt Lord <mattalord@gmail.com>
4ebc2fe to
d303444
Compare
go/test/endtoend/tabletmanager/throttler_topo/throttler_test.go
Outdated
Show resolved
Hide resolved
e20fc72 to
73420fa
Compare
Signed-off-by: Matt Lord <mattalord@gmail.com>
73420fa to
c623dc9
Compare
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
de6ddce to
f7303ba
Compare
Which seemed to revolve around NOT sleeping long enough after starting all the sleep queries. Signed-off-by: Matt Lord <mattalord@gmail.com>
f7303ba to
3a793f8
Compare
Signed-off-by: Matt Lord <mattalord@gmail.com>
shlomi-noach
left a comment
There was a problem hiding this comment.
There's been some changes since my earlier review, so I dismissed it and am submitting this new review. A couple of those changes are a bit unclear to me and I'm grateful if you could answer my inline questions.
| t.Run("enabling throttler with low threshold", func(t *testing.T) { | ||
| _, err := onlineddl.UpdateThrottlerTopoConfig(clusterInstance, true, false, unreasonablyLowThreshold.Seconds(), "", false) | ||
| t.Run("enabling throttler with very low threshold", func(t *testing.T) { | ||
| _, err := throttler.UpdateThrottlerTopoConfig(clusterInstance, true, false, unreasonablyLowThreshold.Seconds(), useDefaultQuery, false) |
| // Wait for the throttler to be enabled everywhere with the new config. | ||
| for _, tablet := range clusterInstance.Keyspaces[0].Shards[0].Vttablets { | ||
| throttler.WaitForThrottlerStatusEnabled(t, tablet, true, &throttler.Config{Query: throttler.DefaultQuery, Threshold: unreasonablyLowThreshold.Seconds()}, throttlerEnabledTimeout) | ||
| } |
| // Wait for the throttler to be disabled everywhere. | ||
| for _, tablet := range clusterInstance.Keyspaces[0].Shards[0].Vttablets { | ||
| throttler.WaitForThrottlerStatusEnabled(t, tablet, false, nil, throttlerEnabledTimeout) | ||
| } |
go/test/endtoend/tabletmanager/throttler_topo/throttler_test.go
Outdated
Show resolved
Hide resolved
Signed-off-by: Matt Lord <mattalord@gmail.com>
And adjust timing Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
| fs.DurationVar(&throttleThreshold, "throttle_threshold", throttleThreshold, "Replication lag threshold for default lag throttling") | ||
| fs.StringVar(&throttleMetricQuery, "throttle_metrics_query", throttleMetricQuery, "Override default heartbeat/lag metric. Use either `SELECT` (must return single row, single value) or `SHOW GLOBAL ... LIKE ...` queries. Set -throttle_metrics_threshold respectively.") | ||
| fs.Float64Var(&throttleMetricThreshold, "throttle_metrics_threshold", throttleMetricThreshold, "Override default throttle threshold, respective to -throttle_metrics_query") | ||
| fs.Float64Var(&throttleMetricThreshold, "throttle_metrics_threshold", throttleMetricThreshold, "Override default throttle threshold, respective to --throttle_metrics_query") |
There was a problem hiding this comment.
not actually relevant to the changes in this PR - what does respective to --throttle_metrics_query mean?
maybe a question more for @shlomi-noach
There was a problem hiding this comment.
It comes hand-in-hand with --throttle_metrics_query. If you define a --throttle_metrics_query, then you should also say what's the --throttle_metrics_threshold at which the throttle will engage. Not sure how to phrase it otherwise.
This is because enabling heartbeats with --heartbeat_enable also results in the replication reporter being enabled: https://github.com/vitessio/vitess/blob/3d9ef871e42bd20a60ec95997c97ecf0694c1e78/go/vt/vttablet/tabletserver/tabletenv/config.go#L235-L237 Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
|
I was unable to backport this Pull Request to the following branches: |
…io#12737) * Flakes: effectively disable vtorc for deterministic behavior For example, we stop replication, wait a few seconds, then expect there to be lag. But vtorc could repair replication during that wait and then the lag is gone. Signed-off-by: Matt Lord <mattalord@gmail.com> * Wait for the throttler to be up and running everywhere Signed-off-by: Matt Lord <mattalord@gmail.com> * Expose tablet's throttler config and leverage to deflake tests Signed-off-by: Matt Lord <mattalord@gmail.com> * Apply various corrections Signed-off-by: Matt Lord <mattalord@gmail.com> * Be more explicit about VTOrc behavior changes Signed-off-by: Matt Lord <mattalord@gmail.com> * Note received throttler response when it is unexpected Signed-off-by: Matt Lord <mattalord@gmail.com> * Fixes from local testing Signed-off-by: Matt Lord <mattalord@gmail.com> * Nits from self review Signed-off-by: Matt Lord <mattalord@gmail.com> * Use assert.Equalf on failed assertions Signed-off-by: Matt Lord <mattalord@gmail.com> * Ummm, duh. Signed-off-by: Matt Lord <mattalord@gmail.com> * Try to get rid of last bit of flakiness Which seemed to revolve around NOT sleeping long enough after starting all the sleep queries. Signed-off-by: Matt Lord <mattalord@gmail.com> * Nits from self review Signed-off-by: Matt Lord <mattalord@gmail.com> * Address review comments Signed-off-by: Matt Lord <mattalord@gmail.com> * Adjust test for behavior and comment it And adjust timing Signed-off-by: Matt Lord <mattalord@gmail.com> * Align both stale hearbeat checks Signed-off-by: Matt Lord <mattalord@gmail.com> * Remove no longer needed flag This is because enabling heartbeats with --heartbeat_enable also results in the replication reporter being enabled: https://github.com/vitessio/vitess/blob/3d9ef871e42bd20a60ec95997c97ecf0694c1e78/go/vt/vttablet/tabletserver/tabletenv/config.go#L235-L237 Signed-off-by: Matt Lord <mattalord@gmail.com> * Correct comment Signed-off-by: Matt Lord <mattalord@gmail.com> * Correct comment part II: electric boogaloo Signed-off-by: Matt Lord <mattalord@gmail.com> * Revert one other minor unnecessary change. Signed-off-by: Matt Lord <mattalord@gmail.com> --------- Signed-off-by: Matt Lord <mattalord@gmail.com>
…e Tests (#12791) * Throttler: Expose Tablet's Config & Leverage to Deflake Tests (#12737) * Flakes: effectively disable vtorc for deterministic behavior For example, we stop replication, wait a few seconds, then expect there to be lag. But vtorc could repair replication during that wait and then the lag is gone. Signed-off-by: Matt Lord <mattalord@gmail.com> * Wait for the throttler to be up and running everywhere Signed-off-by: Matt Lord <mattalord@gmail.com> * Expose tablet's throttler config and leverage to deflake tests Signed-off-by: Matt Lord <mattalord@gmail.com> * Apply various corrections Signed-off-by: Matt Lord <mattalord@gmail.com> * Be more explicit about VTOrc behavior changes Signed-off-by: Matt Lord <mattalord@gmail.com> * Note received throttler response when it is unexpected Signed-off-by: Matt Lord <mattalord@gmail.com> * Fixes from local testing Signed-off-by: Matt Lord <mattalord@gmail.com> * Nits from self review Signed-off-by: Matt Lord <mattalord@gmail.com> * Use assert.Equalf on failed assertions Signed-off-by: Matt Lord <mattalord@gmail.com> * Ummm, duh. Signed-off-by: Matt Lord <mattalord@gmail.com> * Try to get rid of last bit of flakiness Which seemed to revolve around NOT sleeping long enough after starting all the sleep queries. Signed-off-by: Matt Lord <mattalord@gmail.com> * Nits from self review Signed-off-by: Matt Lord <mattalord@gmail.com> * Address review comments Signed-off-by: Matt Lord <mattalord@gmail.com> * Adjust test for behavior and comment it And adjust timing Signed-off-by: Matt Lord <mattalord@gmail.com> * Align both stale hearbeat checks Signed-off-by: Matt Lord <mattalord@gmail.com> * Remove no longer needed flag This is because enabling heartbeats with --heartbeat_enable also results in the replication reporter being enabled: https://github.com/vitessio/vitess/blob/3d9ef871e42bd20a60ec95997c97ecf0694c1e78/go/vt/vttablet/tabletserver/tabletenv/config.go#L235-L237 Signed-off-by: Matt Lord <mattalord@gmail.com> * Correct comment Signed-off-by: Matt Lord <mattalord@gmail.com> * Correct comment part II: electric boogaloo Signed-off-by: Matt Lord <mattalord@gmail.com> * Revert one other minor unnecessary change. Signed-off-by: Matt Lord <mattalord@gmail.com> --------- Signed-off-by: Matt Lord <mattalord@gmail.com> * Post cherry-pick fixup Signed-off-by: Matt Lord <mattalord@gmail.com> --------- Signed-off-by: Matt Lord <mattalord@gmail.com>
Description
The primary thing that this PR does is expose the tablet's current throttler config — query and threshold — via its
/throttler/statushttp endpoint. This allows operators and test creators to confirm and wait for specific configurations to be live on all tablets.The secondary thing that this PR does is disable
vtorcrecoveries in the throttler tests when needed (may be other reasons as yet unknown or that unintentionally arise later) because we stop replication, wait a few seconds, then expect there to be lag. Butvtorccould repair replication during that wait and then the lag is gone. We had previously explicitly disabled the legacy tablet repair by using the--disable_active_reparentsflag, now in this PR we also disable anyvtorcrepairs.Finally, a minor thing that this PR does is enable the tablet throttler configuration via the topo in the local examples. This does not alter the default behavior, but it allows users to test out this new feature without having to modify any files or restart any processes. This will help with local testing of the new feature as we move it towards GA (currently experimental).
The relevant
throttler_topoworkflow now passed 15 times in a row: https://github.com/vitessio/vitess/actions/runs/4557359892?pr=12737The failures now seem due to some odd/unexpected behavior of the throttler itself during test runs. We'll track any bugs down over time -- which will now be much easier with the updated tests.
Related Issue(s)
Checklist