Added watchdog support for a Multi-Kill threshold.#12108
Added watchdog support for a Multi-Kill threshold.#12108mattklein123 merged 9 commits intoenvoyproxy:masterfrom
Conversation
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
|
/ review @antoniovicente |
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
source/server/guarddog_impl.cc
Outdated
| time_source_(api.timeSource()), miss_timeout_(config.wdMissTimeout()), | ||
| megamiss_timeout_(config.wdMegaMissTimeout()), kill_timeout_(config.wdKillTimeout()), | ||
| multi_kill_timeout_(config.wdMultiKillTimeout()), | ||
| multi_kill_threshold_(config.wdMultiKillThreshold() / 100.0), |
There was a problem hiding this comment.
It's unclear why you need to divide by 100 when converting one kind of threshold to the other. Could be clarified a bit by using a name like multi_kill_fraction_ or multi_kill_ratio_ for the member variable, implying it is a value in the range [0.0,1.0] instead of a percentage.
source/server/guarddog_impl.cc
Outdated
| Thread::LockGuard guard(wd_lock_); | ||
|
|
||
| // Compute the multikill threshold | ||
| const ssize_t multikill_threshold = |
There was a problem hiding this comment.
multi_kill_threshold and multi_kill_threshold_ mean very different things in this statement.
Consider something like: required_for_multi_kill
source/server/guarddog_impl.cc
Outdated
|
|
||
| // Compute the multikill threshold | ||
| const ssize_t multikill_threshold = | ||
| std::max(2, static_cast<int>(multi_kill_threshold_ * watched_dogs_.size())); |
There was a problem hiding this comment.
Should you round up?
Also, I think you may want to use size_t instead of ssize_t, also use size_t in the static_cast.
There was a problem hiding this comment.
Yep will round up. Done.
include/envoy/server/configuration.h
Outdated
| /** | ||
| * @return double the percentage of threads that needs to meet the MultiKillTimeout before we | ||
| * kill the process. If it is zero, then it'll fallback to the default behavior of | ||
| * killing the process if two threads hit the multikill timeout. |
There was a problem hiding this comment.
comment nit: The number of threads required for multi-kill is always at least 2. This behavior applies beyond just the 0 default case. You may want to refer to the max(2, registered_threads * percentage) computation.
There was a problem hiding this comment.
Tried to clarify it a bit more.
| google.protobuf.Duration multikill_timeout = 4; | ||
|
|
||
| // Sets threshold for multikill timeout in terms of the percentage of | ||
| // Watchdogs being nonresponsive for at least the multikill_timeout. |
There was a problem hiding this comment.
nit: watchdogs -> threads
percentage of nonresponsive threads required for the multikill_timeout.
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
…l-threshold Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
antoniovicente
left a comment
There was a problem hiding this comment.
Thanks for the change, and sorry for the slight delays in review.
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
|
/ retest |
|
🐴 hold your horses - no failures detected, yet. |
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
|
PTAL @envoyproxy/api-shepherds |
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
|
/lgtm api |
|
PTAL @envoyproxy/maintainers |
WatchDog will now kill if max(2, registered_threads * multi_kill_threshold) threads have gone above the multikill_timeout. Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
WatchDog will now kill if max(2, registered_threads * multi_kill_threshold) threads have gone above the multikill_timeout. Signed-off-by: Kevin Baichoo <kbaichoo@google.com> Signed-off-by: scheler <santosh.cheler@appdynamics.com>
Signed-off-by: Kevin Baichoo kbaichoo@google.com
Commit Message: MultiKill threshold support in WatchDog.
Additional Description: WatchDog will now kill if
max(2, registered_threads * multi_kill_threshold)threads have gone above themultikill_timeout.Risk Level: Low, backwards compatible by default
Testing: Implemented Unit tests
Docs Changes: No
Release Notes: None (this is fully backward compatible by default.)
Fixes #11389