Skip to content

upstream: handle health check fail after removal#6765

Merged
mattklein123 merged 8 commits intomasterfrom
fix_eds_race
May 1, 2019
Merged

upstream: handle health check fail after removal#6765
mattklein123 merged 8 commits intomasterfrom
fix_eds_race

Conversation

@mattklein123
Copy link
Member

When using active health checking, hosts are not removed from
dynamic clusters if they are still passing health checks. This
creates a situation in which hosts might not be removed for a
very long time if the sequence is reversed; removal followed by
health check failure. This change handles the second case so that
any time a host is both removed AND failing active health check,
in any order, it will be removed.

This has been an issue "forever" but is more obvious when using
streaming EDS or very long polling DNS.

Fixes #6625

Signed-off-by: Matt Klein mklein@lyft.com

Risk Level: Medium/High. Scary stuff.
Testing: New unit tests.
Docs Changes: N/A
Release Notes: N/A

When using active health checking, hosts are not removed from
dynamic clusters if they are still passing health checks. This
creates a situation in which hosts might not be removed for a
very long time if the sequence is reversed; removal followed by
health check failure. This change handles the second case so that
any time a host is both removed AND failing active health check,
in any order, it will be removed.

This has been an issue "forever" but is more obvious when using
streaming EDS or very long polling DNS.

Fixes #6625

Signed-off-by: Matt Klein <mklein@lyft.com>
@mattklein123
Copy link
Member Author

mattklein123 commented May 1, 2019

@snowp I'm going to take a fresh pass on this tomorrow and add some more tests and see if I can figure out a better solution for the all_hosts_ issue we discussed offline. Can you take a first look and let me know if you have any initial comments or things I should actively look to test? Thank you.

@mattklein123
Copy link
Member Author

cc @lita

Signed-off-by: Matt Klein <mklein@lyft.com>
Signed-off-by: Matt Klein <mklein@lyft.com>
Copy link
Contributor

@snowp snowp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems good to me modulo the strict DNS issue

Signed-off-by: Matt Klein <mklein@lyft.com>
Signed-off-by: Matt Klein <mklein@lyft.com>
Signed-off-by: Matt Klein <mklein@lyft.com>
Signed-off-by: Matt Klein <mklein@lyft.com>
@mattklein123
Copy link
Member Author

@snowp updated to support only EDS and add better tests. I think this is a better solution for now. PTAL.

Copy link
Contributor

@snowp snowp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just one minor comment

Signed-off-by: Matt Klein <mklein@lyft.com>
@mattklein123
Copy link
Member Author

@snowp thanks updated

Copy link
Contributor

@snowp snowp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mattklein123 mattklein123 merged commit 41eefff into master May 1, 2019
@mattklein123 mattklein123 deleted the fix_eds_race branch May 1, 2019 22:00
jeffpiazza-google pushed a commit to jeffpiazza-google/envoy that referenced this pull request May 3, 2019
When using active health checking, hosts are not removed from
dynamic clusters if they are still passing health checks. This
creates a situation in which hosts might not be removed for a
very long time if the sequence is reversed; removal followed by
health check failure. This change handles the second case so that
any time a host is both removed AND failing active health check,
in any order, it will be removed.

This has been an issue "forever" but is more obvious when using
streaming EDS or very long polling DNS.

Fixes envoyproxy#6625

Signed-off-by: Matt Klein <mklein@lyft.com>
Signed-off-by: Jeff Piazza <jeffpiazza@google.com>
mattklein123 added a commit that referenced this pull request May 5, 2019
This fixes a regression from #6765 due to not handling recursive
deletion inside of a failure callback.

Fixes #6806

Signed-off-by: Matt Klein <mklein@lyft.com>
mattklein123 added a commit that referenced this pull request May 6, 2019
This fixes a regression from #6765 due to not handling recursive
deletion inside of a failure callback.

Fixes #6806

Signed-off-by: Matt Klein <mklein@lyft.com>
mattklein123 added a commit that referenced this pull request May 14, 2019
If we inline delete a host during a failure callback we need to
account for the connection being cleaned up prior to handling
'connection: close' headers.

Signed-off-by: Matt Klein <mklein@lyft.com>
mattklein123 added a commit that referenced this pull request May 15, 2019
If we inline delete a host during a failure callback we need to
account for the connection being cleaned up prior to handling
'connection: close' headers.

Signed-off-by: Matt Klein <mklein@lyft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cluster membership race when streaming eds data and active health checking

2 participants