Skip to content

Added watchdog support for a Multi-Kill threshold.#12108

Merged
mattklein123 merged 9 commits intoenvoyproxy:masterfrom
KBaichoo:watchdog-multikill-threshold
Jul 21, 2020
Merged

Added watchdog support for a Multi-Kill threshold.#12108
mattklein123 merged 9 commits intoenvoyproxy:masterfrom
KBaichoo:watchdog-multikill-threshold

Conversation

@KBaichoo
Copy link
Contributor

Signed-off-by: Kevin Baichoo kbaichoo@google.com

Commit Message: MultiKill threshold support in WatchDog.
Additional Description: WatchDog will now kill if max(2, registered_threads * multi_kill_threshold) threads have gone above the multikill_timeout.
Risk Level: Low, backwards compatible by default
Testing: Implemented Unit tests
Docs Changes: No
Release Notes: None (this is fully backward compatible by default.)
Fixes #11389

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
@repokitteh-read-only
Copy link

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/envoy/.
CC @envoyproxy/api-watchers: FYI only for changes made to api/envoy/.

🐱

Caused by: #12108 was opened by KBaichoo.

see: more, trace.

@KBaichoo
Copy link
Contributor Author

/ review @antoniovicente

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
time_source_(api.timeSource()), miss_timeout_(config.wdMissTimeout()),
megamiss_timeout_(config.wdMegaMissTimeout()), kill_timeout_(config.wdKillTimeout()),
multi_kill_timeout_(config.wdMultiKillTimeout()),
multi_kill_threshold_(config.wdMultiKillThreshold() / 100.0),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear why you need to divide by 100 when converting one kind of threshold to the other. Could be clarified a bit by using a name like multi_kill_fraction_ or multi_kill_ratio_ for the member variable, implying it is a value in the range [0.0,1.0] instead of a percentage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Thread::LockGuard guard(wd_lock_);

// Compute the multikill threshold
const ssize_t multikill_threshold =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

multi_kill_threshold and multi_kill_threshold_ mean very different things in this statement.

Consider something like: required_for_multi_kill

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


// Compute the multikill threshold
const ssize_t multikill_threshold =
std::max(2, static_cast<int>(multi_kill_threshold_ * watched_dogs_.size()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you round up?

Also, I think you may want to use size_t instead of ssize_t, also use size_t in the static_cast.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep will round up. Done.

/**
* @return double the percentage of threads that needs to meet the MultiKillTimeout before we
* kill the process. If it is zero, then it'll fallback to the default behavior of
* killing the process if two threads hit the multikill timeout.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment nit: The number of threads required for multi-kill is always at least 2. This behavior applies beyond just the 0 default case. You may want to refer to the max(2, registered_threads * percentage) computation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried to clarify it a bit more.

google.protobuf.Duration multikill_timeout = 4;

// Sets threshold for multikill timeout in terms of the percentage of
// Watchdogs being nonresponsive for at least the multikill_timeout.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: watchdogs -> threads

percentage of nonresponsive threads required for the multikill_timeout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

KBaichoo added 3 commits July 16, 2020 18:36
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
…l-threshold

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
antoniovicente
antoniovicente previously approved these changes Jul 17, 2020
Copy link
Contributor

@antoniovicente antoniovicente left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change, and sorry for the slight delays in review.

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
antoniovicente
antoniovicente previously approved these changes Jul 17, 2020
@KBaichoo
Copy link
Contributor Author

/ retest

@repokitteh-read-only
Copy link

🐴 hold your horses - no failures detected, yet.

🐱

Caused by: a #12108 (comment) was created by @KBaichoo.

see: more, trace.

KBaichoo added 2 commits July 20, 2020 13:09
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
@KBaichoo
Copy link
Contributor Author

PTAL @envoyproxy/api-shepherds

@mattklein123 mattklein123 self-assigned this Jul 20, 2020
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
@htuch
Copy link
Member

htuch commented Jul 20, 2020

/lgtm api

@KBaichoo
Copy link
Contributor Author

PTAL @envoyproxy/maintainers

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@mattklein123 mattklein123 merged commit 7f78581 into envoyproxy:master Jul 21, 2020
KBaichoo added a commit to KBaichoo/envoy that referenced this pull request Jul 30, 2020
WatchDog will now kill if max(2, registered_threads * multi_kill_threshold) threads have gone above the multikill_timeout.

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
scheler pushed a commit to scheler/envoy that referenced this pull request Aug 4, 2020
WatchDog will now kill if max(2, registered_threads * multi_kill_threshold) threads have gone above the multikill_timeout.

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: scheler <santosh.cheler@appdynamics.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[watchdog] Config options for random delay before kill and percent of threads stuck before multi-kill

4 participants