Simplify replication lag selection logic#5251
Conversation
7eb7dac to
015f1ef
Compare
|
@mpawliszyn @rafael @demmer @tirsen If you don't have objections, we can review this more carefully and get this submitted. |
|
@brirams was actually just looking at this part of the codebase. He can give good input on this. I think I'm also on board in making this simpler. One high level comment, if we decide to move forward, we should make this new approach an opt-in behavior via a flag, so it's easier to rollback if we run into regressions. |
|
I agree with the overall sentiment around simplifying this logic and I think this PR does just However, my concern is that it also changes the existing contract of |
This is definitely the behavior we want. In fact, at the two-hour replication lag mark, the vttablet has a default setting that would make it non-serving: https://github.com/vitessio/vitess/blob/master/go/vt/vttablet/tabletmanager/healthcheck.go#L219. @sayap: will you be able to introduce the flag to allow for fallback? |
761ff06 to
05441fb
Compare
|
I have added a flag to opt-out from the new simplified logic, for users who rely on the existing (legacy) algorithm. |
go/vt/discovery/replicationlag.go
Outdated
There was a problem hiding this comment.
I think this should initially default to true. Otherwise behavior would change for existent users without them opting in.
We can then go through a process of:
- Opt in to new behavior.
- New behavior becomes default.
- Deprecate legacy behavior.
There was a problem hiding this comment.
It is now default to true.
To avoid surprise, the simplified logic is now purely based on the 3 config values, i.e. low lag / high lag / min tablets, without any outlier calculation or special case. Add flag -legacy_replication_lag_algorithm to toggle between the legacy algorithm and the simplified logic, default to true (i.e. the former) for now. Signed-off-by: Yap Sok Ann <sokann@gmail.com>
05441fb to
da9bfed
Compare
|
@rafael This looks ready to go. Can you do a final once-over? |
To avoid surprise, the logic is now purely based on the 3 config values, i.e. low lag / high lag / min tablets, without any outlier calculation or special case.
This gets the logic to match with the unhappy description of "will serve traffic only if there are no fully healthy tablets", shown in the vttablet status page.