Added Fatal Action extension point.#13676
Conversation
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
|
/assign @akonradi |
| core.v3.TypedExtensionConfig config = 1; | ||
|
|
||
| // Whether the action is async-signal-safe. | ||
| bool safe = 2; |
There was a problem hiding this comment.
Please name this more descriptively. "safe" can mean a lot of different things without context.
There was a problem hiding this comment.
Do we even need to put this in the config? Can this be implemented in code, as an interface method on the extension virtual base class?
There was a problem hiding this comment.
Good idea, I'll remove it and just leave it on the interface of the base class.
I think it's a safe assumption to assume clients want safe actions to run before unsafe actions and just go off what the extension implementation class says rather than what this config says.
| /** | ||
| * Whether the action is async-signal-safe. | ||
| */ | ||
| virtual bool isSafe() const PURE; |
There was a problem hiding this comment.
Same, please make the name more descriptive.
test/common/signal/signals_test.cc
Outdated
| auto safe_actions = | ||
| std::make_unique<std::list<const Server::Configuration::FatalActionPtr>>(); | ||
| auto unsafe_actions = | ||
| std::make_unique<std::list<const Server::Configuration::FatalActionPtr>>(); | ||
| FatalErrorHandler::registerFatalActions(std::move(safe_actions), std::move(unsafe_actions), | ||
| nullptr); |
There was a problem hiding this comment.
Can this be done outside the EXPECT_DEATH?
There was a problem hiding this comment.
The issue is that we're dealing with static memory.
I'll set up a test fixture that will have a SetUpTestSuite() class that will register the statics once for these tests.
There was a problem hiding this comment.
SetupTestSuite() and static memory with mocks was a bit bad -- ended up putting a "reset module" function for test use only in fatal_error_handler.cc and narrowed interfaces to prevent the need for mocks.
test/common/signal/signals_test.cc
Outdated
| EXPECT_EQ(raw_safe_action->getNumTimesRan(), 1); | ||
| EXPECT_EQ(raw_unsafe_action->getNumTimesRan(), 1); | ||
| } | ||
|
|
There was a problem hiding this comment.
Is there a way to make runSafeActions return false? If so, please test it. If not, is having a return value meaningful?
There was a problem hiding this comment.
I'll rig up an edge scenario that triggers this -- the value is meaningful if multiple threads are racing to exec that fatal handlers.
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
source/common/signal/fatal_action.cc
Outdated
| FatalActionManager::FatalActionManager(FatalActionPtrList& safe_actions, | ||
| FatalActionPtrList& unsafe_actions, Server::Instance* server) |
There was a problem hiding this comment.
Just pass these arguments by value instead of reference. They contain unique_ptrs so they aren't copyable, so you'll need to wrap in std::move() at the call site.
There was a problem hiding this comment.
Done, the former ended up working 👍
| if (fatal_action_manager.compare_exchange_strong(unset_manager, mananger.get(), | ||
| std::memory_order_acq_rel)) { |
There was a problem hiding this comment.
How would two concurrent calls progress beyond the register_actions check? I suppose since the access to registered_actions isn't atomic, they could race on the read/write, but that sounds bad, and like a separate bug that needs to be fixed. Alternatively, remove register_actions entirely and rely on this compare-and-swap for correctness.
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
…ijacked depending on compile options. Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
|
/retest |
|
Retrying Azure Pipelines: |
akonradi
left a comment
There was a problem hiding this comment.
Looking good, mostly nits.
| FatalErrorHandler::registerFatalActions(std::move(safe_actions_), std::move(unsafe_actions_), | ||
| Thread::threadFactoryForTest()); |
There was a problem hiding this comment.
I would expect this double registration to indicate a fatal bug in the code since it doesn't do anything, and that's unexpected. If this call fails, that should be observable at the call site, either by crashing or returning an error value.
There was a problem hiding this comment.
It no longer silently fails; it calls ENVOY_BUG.
…ore granularity when reporting failures, modified signalaction to take advantage of it. Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
…atters. Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
|
PTAL @envoyproxy/api-shepherds |
| // Fatal actions to run while crashing. | ||
| // We will run all safe actions before we run unsafe actions. | ||
| message FatalAction { | ||
| // Extension specific configuration for the action. |
There was a problem hiding this comment.
@envoyproxy/api-shepherds per discussion earlier today, should we have some declaration of the interface that extensions must conform to? The problem with that in general is that this might be different in Envoy/gRPC, but in this specific case, this is a bootstrap and Envoy-only extension point, so I think that would be pretty useful.
There was a problem hiding this comment.
Discussed this a bit out of band with Harvey.
The comments for the extension now point to the interface it's expected to confirm to, and where extensions should live.
Thoughts @envoyproxy/api-shepherds ? Thanks!
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
|
Thanks for the review! I updated https://www.envoyproxy.io/docs/envoy/latest/extending/extending and the "source/extensions layout" section of https://github.com/envoyproxy/envoy/blob/master/REPO_LAYOUT.md. |
mattklein123
left a comment
There was a problem hiding this comment.
Thanks a few more comments from another pass. Neat!
/wait
| if (fatal_action_manager.compare_exchange_strong(unset_manager, mananger.get(), | ||
| std::memory_order_acq_rel)) { |
There was a problem hiding this comment.
This all runs on the main thread during server init. I'm confused why we need anything special here?
|
|
||
| // Helper function to run fatal actions. | ||
| void runFatalActions(const FatalAction::FatalActionPtrList& actions) { | ||
| FailureFunctionList* list = fatal_error_handlers.exchange(nullptr, std::memory_order_relaxed); |
There was a problem hiding this comment.
Can you add more comments in this function about the atomic operations that we are doing and why we do them?
| return FatalAction::Status::RunningOnAnotherThread; | ||
| } | ||
|
|
||
| FatalAction::Status runUnsafeActions() { |
There was a problem hiding this comment.
This is basically the same function as the one above. Can you have a functor or pass whether to run safe or unsafe to a common function?
There was a problem hiding this comment.
Done, they share a common implementation in an anon. namespace of the module called. Depending on a internal type, we'll run either the safe or unsafe actions.
| default: | ||
| // All the cases runSafeActions() returns have been covered. | ||
| NOT_REACHED_GCOVR_EXCL_LINE; |
There was a problem hiding this comment.
Is this actually needed to compile? Otherwise I would remove.
There was a problem hiding this comment.
Yes, the compiler complains about the enum case that's not handled SafeActionsNotYetRan which isn't returned by runSafeStatus.
I think perhaps exhaustively listing them, instead of having a default like here, would be better as that way new enums added to the class will force the author to update this switch statement vs having it silently compile.
Thoughts?
| } else if (failing_tid == -1) { | ||
| return FatalAction::Status::SafeActionsNotYetRan; |
There was a problem hiding this comment.
How can this happen? More comments?
There was a problem hiding this comment.
It shouldn't happen -- a lot of the module is hardened to try to prevent incorrect uses by giving guide rails to enforce the state machine of the module -- for example only one fatal action manager can be register, we run safe actions before unsafe actions, only one thread should run these actions, among others.
I added a comment about it, lmk your thoughts
| } | ||
|
|
||
| void clearFatalActionsOnTerminate() { | ||
| auto* raw_ptr = fatal_action_manager.exchange(nullptr, std::memory_order_relaxed); |
There was a problem hiding this comment.
Is relaxes here and elsewhere correct? I think you want sequential consistency for this? This is not perf critical so I would recommend just using sequential consistency everywhere to be on the safe side.
There was a problem hiding this comment.
I put sequential consistency, as you suggested, in the places that are either:
- Test only
- Expected to be called once
I went over the file and updated the memory order for all calls using relaxed semantics, there's only one place where I think it's entirely safe. I commented the places I updated explaining my reasoning.
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
KBaichoo
left a comment
There was a problem hiding this comment.
Thanks for the review :)
| } else if (failing_tid == -1) { | ||
| return FatalAction::Status::SafeActionsNotYetRan; |
There was a problem hiding this comment.
It shouldn't happen -- a lot of the module is hardened to try to prevent incorrect uses by giving guide rails to enforce the state machine of the module -- for example only one fatal action manager can be register, we run safe actions before unsafe actions, only one thread should run these actions, among others.
I added a comment about it, lmk your thoughts
| if (fatal_action_manager.compare_exchange_strong(unset_manager, mananger.get(), | ||
| std::memory_order_acq_rel)) { |
There was a problem hiding this comment.
It's the argument I discuss below about preventing incorrect use of the module.
As it's currently tied into the system only in main thread server init, this shouldn't be a problem (but it's a bug if we try to register the manager multiple times.)
| default: | ||
| // All the cases runSafeActions() returns have been covered. | ||
| NOT_REACHED_GCOVR_EXCL_LINE; |
There was a problem hiding this comment.
Yes, the compiler complains about the enum case that's not handled SafeActionsNotYetRan which isn't returned by runSafeStatus.
I think perhaps exhaustively listing them, instead of having a default like here, would be better as that way new enums added to the class will force the author to update this switch statement vs having it silently compile.
Thoughts?
| return FatalAction::Status::RunningOnAnotherThread; | ||
| } | ||
|
|
||
| FatalAction::Status runUnsafeActions() { |
There was a problem hiding this comment.
Done, they share a common implementation in an anon. namespace of the module called. Depending on a internal type, we'll run either the safe or unsafe actions.
|
|
||
| // Helper function to run fatal actions. | ||
| void runFatalActions(const FatalAction::FatalActionPtrList& actions) { | ||
| FailureFunctionList* list = fatal_error_handlers.exchange(nullptr, std::memory_order_relaxed); |
| } | ||
|
|
||
| void clearFatalActionsOnTerminate() { | ||
| auto* raw_ptr = fatal_action_manager.exchange(nullptr, std::memory_order_relaxed); |
There was a problem hiding this comment.
I put sequential consistency, as you suggested, in the places that are either:
- Test only
- Expected to be called once
I went over the file and updated the memory order for all calls using relaxed semantics, there's only one place where I think it's entirely safe. I commented the places I updated explaining my reasoning.
|
re: module hardening, IMO it just makes the code harder to read and it's unclear how or when anyone is going to use this in a way that it wasn't intended. Can we just replace with ASSERTs and revisit later if people need to use the code in a different way? re: memory ordering, is there a good reason to not just use the default option for all atomic operations? (Sequential consistency.) Again, it's hard to reason about, memory ordering is incredibly tricky to get correct, and this code is not perf critical at all, so unclear why we would trade potential bugs and harder to read code for slightly better perf? Otherwise LGTM, thanks. /wait |
… performance focused, replaced cumbersome hardening with asserts. Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
|
/retest |
|
Retrying Azure Pipelines: |
|
/retest |
|
Retrying Azure Pipelines: |
mattklein123
left a comment
There was a problem hiding this comment.
Thanks LGTM with small final change.
/wait
| // Restore the fatal_error_handlers pointer so subsequent calls using the list | ||
| // can succeed. | ||
| fatal_error_handlers.store(list, std::memory_order_release); | ||
| fatal_error_handlers.store(list, std::memory_order_seq_cst); |
There was a problem hiding this comment.
nit: IIRC std::memory_order_seq_cst is the default param for all atomic ops. Can you just remove it everywhere? It will make the code easier to read.
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com> Signed-off-by: Qin Qin <qqin@google.com>
Signed-off-by: Kevin Baichoo kbaichoo@google.com
Commit Message: Add a Fatal Action extension point allowing users to run extensions on the current scoped tracked object that is related to the crash.
Additional Description:
Risk Level: medium
Testing: unit tests
Docs Changes: Included.
Release Notes: Included
Platform Specific Features:
Fixes #13091