upstream: handle health check fail after removal by mattklein123 · Pull Request #6765 · envoyproxy/envoy

mattklein123 · 2019-05-01T03:16:13Z

When using active health checking, hosts are not removed from
dynamic clusters if they are still passing health checks. This
creates a situation in which hosts might not be removed for a
very long time if the sequence is reversed; removal followed by
health check failure. This change handles the second case so that
any time a host is both removed AND failing active health check,
in any order, it will be removed.

This has been an issue "forever" but is more obvious when using
streaming EDS or very long polling DNS.

Fixes #6625

Signed-off-by: Matt Klein mklein@lyft.com

Risk Level: Medium/High. Scary stuff.
Testing: New unit tests.
Docs Changes: N/A
Release Notes: N/A

When using active health checking, hosts are not removed from dynamic clusters if they are still passing health checks. This creates a situation in which hosts might not be removed for a very long time if the sequence is reversed; removal followed by health check failure. This change handles the second case so that any time a host is both removed AND failing active health check, in any order, it will be removed. This has been an issue "forever" but is more obvious when using streaming EDS or very long polling DNS. Fixes #6625 Signed-off-by: Matt Klein <mklein@lyft.com>

mattklein123 · 2019-05-01T03:17:04Z

@snowp I'm going to take a fresh pass on this tomorrow and add some more tests and see if I can figure out a better solution for the all_hosts_ issue we discussed offline. Can you take a first look and let me know if you have any initial comments or things I should actively look to test? Thank you.

mattklein123 · 2019-05-01T03:17:12Z

cc @lita

Signed-off-by: Matt Klein <mklein@lyft.com>

snowp

Seems good to me modulo the strict DNS issue

source/common/upstream/upstream_impl.cc

Signed-off-by: Matt Klein <mklein@lyft.com>

mattklein123 · 2019-05-01T18:24:29Z

@snowp updated to support only EDS and add better tests. I think this is a better solution for now. PTAL.

snowp

LGTM, just one minor comment

test/common/upstream/eds_test.cc

Signed-off-by: Matt Klein <mklein@lyft.com>

mattklein123 · 2019-05-01T20:11:05Z

@snowp thanks updated

snowp

LGTM

When using active health checking, hosts are not removed from dynamic clusters if they are still passing health checks. This creates a situation in which hosts might not be removed for a very long time if the sequence is reversed; removal followed by health check failure. This change handles the second case so that any time a host is both removed AND failing active health check, in any order, it will be removed. This has been an issue "forever" but is more obvious when using streaming EDS or very long polling DNS. Fixes envoyproxy#6625 Signed-off-by: Matt Klein <mklein@lyft.com> Signed-off-by: Jeff Piazza <jeffpiazza@google.com>

This fixes a regression from #6765 due to not handling recursive deletion inside of a failure callback. Fixes #6806 Signed-off-by: Matt Klein <mklein@lyft.com>

If we inline delete a host during a failure callback we need to account for the connection being cleaned up prior to handling 'connection: close' headers. Signed-off-by: Matt Klein <mklein@lyft.com>

mattklein123 assigned snowp May 1, 2019

mattklein123 added 2 commits April 30, 2019 20:48

Merge remote-tracking branch 'origin/master' into fix_eds_race

b1b2aac

Signed-off-by: Matt Klein <mklein@lyft.com>

fix asan

b3f545c

Signed-off-by: Matt Klein <mklein@lyft.com>

snowp suggested changes May 1, 2019

View reviewed changes

source/common/upstream/upstream_impl.cc Outdated Show resolved Hide resolved

source/common/upstream/upstream_impl.cc Outdated Show resolved Hide resolved

mattklein123 added 4 commits May 1, 2019 08:59

Merge remote-tracking branch 'origin/master' into fix_eds_race

8379133

Signed-off-by: Matt Klein <mklein@lyft.com>

comments

770f03e

Signed-off-by: Matt Klein <mklein@lyft.com>

Merge remote-tracking branch 'origin/master' into fix_eds_race

7940a04

Signed-off-by: Matt Klein <mklein@lyft.com>

more

86c4952

Signed-off-by: Matt Klein <mklein@lyft.com>

snowp suggested changes May 1, 2019

View reviewed changes

test/common/upstream/eds_test.cc Show resolved Hide resolved

comment

bcd812a

Signed-off-by: Matt Klein <mklein@lyft.com>

snowp approved these changes May 1, 2019

View reviewed changes

mattklein123 merged commit 41eefff into master May 1, 2019

mattklein123 deleted the fix_eds_race branch May 1, 2019 22:00

mattklein123 mentioned this pull request May 5, 2019

health check: handle host deletion during failure callback #6813

Merged

bcelenza mentioned this pull request Jun 9, 2020

Question: EDS + active health check endpoint removal behavior #11527

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

upstream: handle health check fail after removal#6765

upstream: handle health check fail after removal#6765
mattklein123 merged 8 commits intomasterfrom
fix_eds_race

mattklein123 commented May 1, 2019

Uh oh!

mattklein123 commented May 1, 2019 •

edited

Loading

Uh oh!

mattklein123 commented May 1, 2019

Uh oh!

snowp left a comment

Uh oh!

Uh oh!

Uh oh!

mattklein123 commented May 1, 2019

Uh oh!

snowp left a comment

Uh oh!

Uh oh!

mattklein123 commented May 1, 2019

Uh oh!

snowp left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mattklein123 commented May 1, 2019

Uh oh!

mattklein123 commented May 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattklein123 commented May 1, 2019

Uh oh!

snowp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mattklein123 commented May 1, 2019

Uh oh!

snowp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mattklein123 commented May 1, 2019

Uh oh!

snowp left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mattklein123 commented May 1, 2019 •

edited

Loading