Skip to content

Simplify replication lag selection logic#5251

Merged
rafael merged 1 commit intovitessio:masterfrom
bukalapak:replication-lag-selection-logic
Oct 24, 2019
Merged

Simplify replication lag selection logic#5251
rafael merged 1 commit intovitessio:masterfrom
bukalapak:replication-lag-selection-logic

Conversation

@sayap
Copy link
Contributor

@sayap sayap commented Sep 30, 2019

To avoid surprise, the logic is now purely based on the 3 config values, i.e. low lag / high lag / min tablets, without any outlier calculation or special case.

This gets the logic to match with the unhappy description of "will serve traffic only if there are no fully healthy tablets", shown in the vttablet status page.

@sayap sayap requested a review from sougou as a code owner September 30, 2019 06:07
@sayap sayap force-pushed the replication-lag-selection-logic branch 2 times, most recently from 7eb7dac to 015f1ef Compare October 1, 2019 03:59
@sougou
Copy link
Contributor

sougou commented Oct 5, 2019

@mpawliszyn @rafael @demmer @tirsen
@sayap has deployed this in their production environment and found this approach more stable. I personally found this much easier to understand and reason. So, I asked him to submit this back to vitess.

If you don't have objections, we can review this more carefully and get this submitted.

@rafael
Copy link
Member

rafael commented Oct 7, 2019

@brirams was actually just looking at this part of the codebase. He can give good input on this.

I think I'm also on board in making this simpler. One high level comment, if we decide to move forward, we should make this new approach an opt-in behavior via a flag, so it's easier to rollback if we run into regressions.

@brirams
Copy link
Collaborator

brirams commented Oct 8, 2019

I agree with the overall sentiment around simplifying this logic and I think this PR does just
that. It makes it a lot clearer what you're controlling when setting the two parameters in question
(discovery_low_replication_lag and discovery_high_replication_lag_minimum_serving) by strictly evicting tablets that are above that high watermark.

However, my concern is that it also changes the existing contract of FilterByReplicationLag by
potentially returning an empty set of tablets when replication is too high. In the old way, you'd
always have something to route queries to, even when replication was too high. This is more a
question for @sougou but is that the type of behavior we want? Whatever the answer, I agree with
@rafael in that we should be able to opt-in to this maybe by providing a flag that switches between
the old and new ways of filtering and maybe consider deprecating the unsupported one.

@sougou
Copy link
Contributor

sougou commented Oct 11, 2019

However, my concern is that it also changes the existing contract of FilterByReplicationLag by
potentially returning an empty set of tablets when replication is too high. In the old way, you'd
always have something to route queries to, even when replication was too high. This is more a
question for @sougou but is that the type of behavior we want?

This is definitely the behavior we want. In fact, at the two-hour replication lag mark, the vttablet has a default setting that would make it non-serving: https://github.com/vitessio/vitess/blob/master/go/vt/vttablet/tabletmanager/healthcheck.go#L219.

@sayap: will you be able to introduce the flag to allow for fallback?

@sayap sayap force-pushed the replication-lag-selection-logic branch 4 times, most recently from 761ff06 to 05441fb Compare October 22, 2019 12:51
@sayap
Copy link
Contributor Author

sayap commented Oct 22, 2019

I have added a flag to opt-out from the new simplified logic, for users who rely on the existing (legacy) algorithm.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should initially default to true. Otherwise behavior would change for existent users without them opting in.

We can then go through a process of:

  1. Opt in to new behavior.
  2. New behavior becomes default.
  3. Deprecate legacy behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is now default to true.

To avoid surprise, the simplified logic is now purely based on the 3
config values, i.e. low lag / high lag / min tablets, without any
outlier calculation or special case.

Add flag -legacy_replication_lag_algorithm to toggle between the legacy
algorithm and the simplified logic, default to true (i.e. the former)
for now.

Signed-off-by: Yap Sok Ann <sokann@gmail.com>
@sayap sayap force-pushed the replication-lag-selection-logic branch from 05441fb to da9bfed Compare October 23, 2019 04:52
@sougou
Copy link
Contributor

sougou commented Oct 24, 2019

@rafael This looks ready to go. Can you do a final once-over?

Copy link
Member

@rafael rafael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM +1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants