Skip to content

Added Fatal Action extension point.#13676

Merged
mattklein123 merged 25 commits intoenvoyproxy:masterfrom
KBaichoo:fh-extension-pt
Nov 18, 2020
Merged

Added Fatal Action extension point.#13676
mattklein123 merged 25 commits intoenvoyproxy:masterfrom
KBaichoo:fh-extension-pt

Conversation

@KBaichoo
Copy link
Contributor

@KBaichoo KBaichoo commented Oct 21, 2020

Signed-off-by: Kevin Baichoo kbaichoo@google.com

Commit Message: Add a Fatal Action extension point allowing users to run extensions on the current scoped tracked object that is related to the crash.
Additional Description:
Risk Level: medium
Testing: unit tests
Docs Changes: Included.
Release Notes: Included
Platform Specific Features:
Fixes #13091

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
@repokitteh-read-only
Copy link

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/envoy/.
CC @envoyproxy/api-watchers: FYI only for changes made to api/envoy/.

🐱

Caused by: #13676 was opened by KBaichoo.

see: more, trace.

@KBaichoo
Copy link
Contributor Author

/assign @akonradi

core.v3.TypedExtensionConfig config = 1;

// Whether the action is async-signal-safe.
bool safe = 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please name this more descriptively. "safe" can mean a lot of different things without context.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we even need to put this in the config? Can this be implemented in code, as an interface method on the extension virtual base class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I'll remove it and just leave it on the interface of the base class.

I think it's a safe assumption to assume clients want safe actions to run before unsafe actions and just go off what the extension implementation class says rather than what this config says.

/**
* Whether the action is async-signal-safe.
*/
virtual bool isSafe() const PURE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, please make the name more descriptive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +226 to +231
auto safe_actions =
std::make_unique<std::list<const Server::Configuration::FatalActionPtr>>();
auto unsafe_actions =
std::make_unique<std::list<const Server::Configuration::FatalActionPtr>>();
FatalErrorHandler::registerFatalActions(std::move(safe_actions), std::move(unsafe_actions),
nullptr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be done outside the EXPECT_DEATH?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that we're dealing with static memory.

I'll set up a test fixture that will have a SetUpTestSuite() class that will register the statics once for these tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SetupTestSuite() and static memory with mocks was a bit bad -- ended up putting a "reset module" function for test use only in fatal_error_handler.cc and narrowed interfaces to prevent the need for mocks.

EXPECT_EQ(raw_safe_action->getNumTimesRan(), 1);
EXPECT_EQ(raw_unsafe_action->getNumTimesRan(), 1);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to make runSafeActions return false? If so, please test it. If not, is having a return value meaningful?

Copy link
Contributor Author

@KBaichoo KBaichoo Oct 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll rig up an edge scenario that triggers this -- the value is meaningful if multiple threads are racing to exec that fatal handlers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Comment on lines +6 to +7
FatalActionManager::FatalActionManager(FatalActionPtrList& safe_actions,
FatalActionPtrList& unsafe_actions, Server::Instance* server)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just pass these arguments by value instead of reference. They contain unique_ptrs so they aren't copyable, so you'll need to wrap in std::move() at the call site.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, the former ended up working 👍

Comment on lines +113 to +114
if (fatal_action_manager.compare_exchange_strong(unset_manager, mananger.get(),
std::memory_order_acq_rel)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would two concurrent calls progress beyond the register_actions check? I suppose since the access to registered_actions isn't atomic, they could race on the read/write, but that sounds bad, and like a separate bug that needs to be fixed. Alternatively, remove register_actions entirely and rely on this compare-and-swap for correctness.

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
…ijacked depending on compile options.

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
@KBaichoo
Copy link
Contributor Author

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #13676 (comment) was created by @KBaichoo.

see: more, trace.

Copy link
Contributor

@akonradi akonradi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, mostly nits.

Comment on lines +74 to +75
FatalErrorHandler::registerFatalActions(std::move(safe_actions_), std::move(unsafe_actions_),
Thread::threadFactoryForTest());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect this double registration to indicate a fatal bug in the code since it doesn't do anything, and that's unexpected. If this call fails, that should be observable at the call site, either by crashing or returning an error value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It no longer silently fails; it calls ENVOY_BUG.

…ore granularity when reporting failures, modified signalaction to take advantage of it.

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
…atters.

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
akonradi
akonradi previously approved these changes Nov 5, 2020
Copy link
Contributor

@akonradi akonradi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo nits

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
@KBaichoo
Copy link
Contributor Author

KBaichoo commented Nov 5, 2020

PTAL @envoyproxy/api-shepherds

// Fatal actions to run while crashing.
// We will run all safe actions before we run unsafe actions.
message FatalAction {
// Extension specific configuration for the action.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@envoyproxy/api-shepherds per discussion earlier today, should we have some declaration of the interface that extensions must conform to? The problem with that in general is that this might be different in Envoy/gRPC, but in this specific case, this is a bootstrap and Envoy-only extension point, so I think that would be pretty useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed this a bit out of band with Harvey.

The comments for the extension now point to the interface it's expected to confirm to, and where extensions should live.

Thoughts @envoyproxy/api-shepherds ? Thanks!

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
@KBaichoo
Copy link
Contributor Author

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a few more comments from another pass. Neat!

/wait

Comment on lines +113 to +114
if (fatal_action_manager.compare_exchange_strong(unset_manager, mananger.get(),
std::memory_order_acq_rel)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all runs on the main thread during server init. I'm confused why we need anything special here?


// Helper function to run fatal actions.
void runFatalActions(const FatalAction::FatalActionPtrList& actions) {
FailureFunctionList* list = fatal_error_handlers.exchange(nullptr, std::memory_order_relaxed);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add more comments in this function about the atomic operations that we are doing and why we do them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

return FatalAction::Status::RunningOnAnotherThread;
}

FatalAction::Status runUnsafeActions() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is basically the same function as the one above. Can you have a functor or pass whether to run safe or unsafe to a common function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, they share a common implementation in an anon. namespace of the module called. Depending on a internal type, we'll run either the safe or unsafe actions.

Comment on lines +53 to +55
default:
// All the cases runSafeActions() returns have been covered.
NOT_REACHED_GCOVR_EXCL_LINE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually needed to compile? Otherwise I would remove.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the compiler complains about the enum case that's not handled SafeActionsNotYetRan which isn't returned by runSafeStatus.

I think perhaps exhaustively listing them, instead of having a default like here, would be better as that way new enums added to the class will force the author to update this switch statement vs having it silently compile.

Thoughts?

Comment on lines +159 to +160
} else if (failing_tid == -1) {
return FatalAction::Status::SafeActionsNotYetRan;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can this happen? More comments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't happen -- a lot of the module is hardened to try to prevent incorrect uses by giving guide rails to enforce the state machine of the module -- for example only one fatal action manager can be register, we run safe actions before unsafe actions, only one thread should run these actions, among others.

I added a comment about it, lmk your thoughts

}

void clearFatalActionsOnTerminate() {
auto* raw_ptr = fatal_action_manager.exchange(nullptr, std::memory_order_relaxed);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is relaxes here and elsewhere correct? I think you want sequential consistency for this? This is not perf critical so I would recommend just using sequential consistency everywhere to be on the safe side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put sequential consistency, as you suggested, in the places that are either:

  1. Test only
  2. Expected to be called once

I went over the file and updated the memory order for all calls using relaxed semantics, there's only one place where I think it's entirely safe. I commented the places I updated explaining my reasoning.

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Copy link
Contributor Author

@KBaichoo KBaichoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review :)

Comment on lines +159 to +160
} else if (failing_tid == -1) {
return FatalAction::Status::SafeActionsNotYetRan;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't happen -- a lot of the module is hardened to try to prevent incorrect uses by giving guide rails to enforce the state machine of the module -- for example only one fatal action manager can be register, we run safe actions before unsafe actions, only one thread should run these actions, among others.

I added a comment about it, lmk your thoughts

Comment on lines +113 to +114
if (fatal_action_manager.compare_exchange_strong(unset_manager, mananger.get(),
std::memory_order_acq_rel)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the argument I discuss below about preventing incorrect use of the module.

As it's currently tied into the system only in main thread server init, this shouldn't be a problem (but it's a bug if we try to register the manager multiple times.)

Comment on lines +53 to +55
default:
// All the cases runSafeActions() returns have been covered.
NOT_REACHED_GCOVR_EXCL_LINE;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the compiler complains about the enum case that's not handled SafeActionsNotYetRan which isn't returned by runSafeStatus.

I think perhaps exhaustively listing them, instead of having a default like here, would be better as that way new enums added to the class will force the author to update this switch statement vs having it silently compile.

Thoughts?

return FatalAction::Status::RunningOnAnotherThread;
}

FatalAction::Status runUnsafeActions() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, they share a common implementation in an anon. namespace of the module called. Depending on a internal type, we'll run either the safe or unsafe actions.


// Helper function to run fatal actions.
void runFatalActions(const FatalAction::FatalActionPtrList& actions) {
FailureFunctionList* list = fatal_error_handlers.exchange(nullptr, std::memory_order_relaxed);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

void clearFatalActionsOnTerminate() {
auto* raw_ptr = fatal_action_manager.exchange(nullptr, std::memory_order_relaxed);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put sequential consistency, as you suggested, in the places that are either:

  1. Test only
  2. Expected to be called once

I went over the file and updated the memory order for all calls using relaxed semantics, there's only one place where I think it's entirely safe. I commented the places I updated explaining my reasoning.

@mattklein123
Copy link
Member

re: module hardening, IMO it just makes the code harder to read and it's unclear how or when anyone is going to use this in a way that it wasn't intended. Can we just replace with ASSERTs and revisit later if people need to use the code in a different way?

re: memory ordering, is there a good reason to not just use the default option for all atomic operations? (Sequential consistency.) Again, it's hard to reason about, memory ordering is incredibly tricky to get correct, and this code is not perf critical at all, so unclear why we would trade potential bugs and harder to read code for slightly better perf?

Otherwise LGTM, thanks.

/wait

… performance focused, replaced cumbersome hardening with asserts.

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
@KBaichoo
Copy link
Contributor Author

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines:
Check envoy-presubmit isn't fully completed, but will still attempt retrying.
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #13676 (comment) was created by @KBaichoo.

see: more, trace.

@KBaichoo
Copy link
Contributor Author

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #13676 (comment) was created by @KBaichoo.

see: more, trace.

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks LGTM with small final change.

/wait

// Restore the fatal_error_handlers pointer so subsequent calls using the list
// can succeed.
fatal_error_handlers.store(list, std::memory_order_release);
fatal_error_handlers.store(list, std::memory_order_seq_cst);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: IIRC std::memory_order_seq_cst is the default param for all atomic ops. Can you just remove it everywhere? It will make the code easier to read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops. Done.

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks!

@mattklein123 mattklein123 merged commit 6e5227e into envoyproxy:master Nov 18, 2020
andreyprezotto pushed a commit to andreyprezotto/envoy that referenced this pull request Nov 24, 2020
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
qqustc pushed a commit to qqustc/envoy that referenced this pull request Nov 24, 2020
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Qin Qin <qqin@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add an extension point for Fatal Handlers

4 participants