Added Fatal Action extension point. by KBaichoo · Pull Request #13676 · envoyproxy/envoy

KBaichoo · 2020-10-21T15:18:13Z

Signed-off-by: Kevin Baichoo kbaichoo@google.com

Commit Message: Add a Fatal Action extension point allowing users to run extensions on the current scoped tracked object that is related to the crash.
Additional Description:
Risk Level: medium
Testing: unit tests
Docs Changes: Included.
Release Notes: Included
Platform Specific Features:
Fixes #13091

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

repokitteh-read-only · 2020-10-21T15:18:20Z

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/envoy/.
CC @envoyproxy/api-watchers: FYI only for changes made to api/envoy/.

🐱

Caused by: #13676 was opened by KBaichoo.

see: more, trace.

KBaichoo · 2020-10-21T15:22:47Z

/assign @akonradi

akonradi · 2020-10-21T15:25:53Z

api/envoy/config/bootstrap/v3/bootstrap.proto

+  core.v3.TypedExtensionConfig config = 1;
+
+  // Whether the action is async-signal-safe.
+  bool safe = 2;


Please name this more descriptively. "safe" can mean a lot of different things without context.

Do we even need to put this in the config? Can this be implemented in code, as an interface method on the extension virtual base class?

Good idea, I'll remove it and just leave it on the interface of the base class.

I think it's a safe assumption to assume clients want safe actions to run before unsafe actions and just go off what the extension implementation class says rather than what this config says.

akonradi · 2020-10-21T15:26:28Z

include/envoy/server/fatal_action_config.h

+  /**
+   *  Whether the action is async-signal-safe.
+   */
+  virtual bool isSafe() const PURE;


Same, please make the name more descriptive.

source/common/signal/fatal_action.h

akonradi · 2020-10-21T17:35:59Z

test/common/signal/signals_test.cc

+        auto safe_actions =
+            std::make_unique<std::list<const Server::Configuration::FatalActionPtr>>();
+        auto unsafe_actions =
+            std::make_unique<std::list<const Server::Configuration::FatalActionPtr>>();
+        FatalErrorHandler::registerFatalActions(std::move(safe_actions), std::move(unsafe_actions),
+                                                nullptr);


Can this be done outside the EXPECT_DEATH?

The issue is that we're dealing with static memory.

I'll set up a test fixture that will have a SetUpTestSuite() class that will register the statics once for these tests.

SetupTestSuite() and static memory with mocks was a bit bad -- ended up putting a "reset module" function for test use only in fatal_error_handler.cc and narrowed interfaces to prevent the need for mocks.

akonradi · 2020-10-21T17:38:08Z

test/common/signal/signals_test.cc

+  EXPECT_EQ(raw_safe_action->getNumTimesRan(), 1);
+  EXPECT_EQ(raw_unsafe_action->getNumTimesRan(), 1);
+}
+


Is there a way to make runSafeActions return false? If so, please test it. If not, is having a return value meaningful?

I'll rig up an edge scenario that triggers this -- the value is meaningful if multiple threads are racing to exec that fatal handlers.

test/mocks/server/fatal_action_factory.cc

test/mocks/server/fatal_action_factory.h

test/server/test_data/server/fatal_actions.yaml

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

source/common/signal/fatal_action.cc

akonradi · 2020-10-22T20:39:07Z

source/common/signal/fatal_action.cc

+FatalActionManager::FatalActionManager(FatalActionPtrList& safe_actions,
+                                       FatalActionPtrList& unsafe_actions, Server::Instance* server)


Just pass these arguments by value instead of reference. They contain unique_ptrs so they aren't copyable, so you'll need to wrap in std::move() at the call site.

Done, the former ended up working 👍

akonradi · 2020-10-22T20:46:01Z

source/common/signal/fatal_error_handler.cc

+  if (fatal_action_manager.compare_exchange_strong(unset_manager, mananger.get(),
+                                                   std::memory_order_acq_rel)) {


How would two concurrent calls progress beyond the register_actions check? I suppose since the access to registered_actions isn't atomic, they could race on the read/write, but that sounds bad, and like a separate bug that needs to be fixed. Alternatively, remove register_actions entirely and rely on this compare-and-swap for correctness.

source/common/signal/fatal_error_handler.cc

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

…ijacked depending on compile options. Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

KBaichoo · 2020-10-28T20:49:16Z

/retest

repokitteh-read-only · 2020-10-28T20:49:21Z

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #13676 (comment) was created by @KBaichoo.

see: more, trace.

akonradi

Looking good, mostly nits.

source/common/signal/fatal_error_handler.cc

source/server/server.cc

test/common/signal/fatal_action_test.cc

akonradi · 2020-10-28T20:44:42Z

test/common/signal/fatal_action_test.cc

+  FatalErrorHandler::registerFatalActions(std::move(safe_actions_), std::move(unsafe_actions_),
+                                          Thread::threadFactoryForTest());


I would expect this double registration to indicate a fatal bug in the code since it doesn't do anything, and that's unexpected. If this call fails, that should be observable at the call site, either by crashing or returning an error value.

It no longer silently fails; it calls ENVOY_BUG.

test/common/signal/signals_test.cc

…ore granularity when reporting failures, modified signalaction to take advantage of it. Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

…atters. Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

akonradi

LGTM modulo nits

source/common/signal/fatal_error_handler.h

source/common/event/dispatcher_impl.h

test/common/signal/signals_test.cc

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

KBaichoo · 2020-11-05T20:08:58Z

PTAL @envoyproxy/api-shepherds

htuch · 2020-11-09T21:45:05Z

api/envoy/config/bootstrap/v3/bootstrap.proto

+// Fatal actions to run while crashing.
+// We will run all safe actions before we run unsafe actions.
+message FatalAction {
+  // Extension specific configuration for the action.


@envoyproxy/api-shepherds per discussion earlier today, should we have some declaration of the interface that extensions must conform to? The problem with that in general is that this might be different in Envoy/gRPC, but in this specific case, this is a bootstrap and Envoy-only extension point, so I think that would be pretty useful.

Discussed this a bit out of band with Harvey.

The comments for the extension now point to the interface it's expected to confirm to, and where extensions should live.

Thoughts @envoyproxy/api-shepherds ? Thanks!

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

KBaichoo · 2020-11-16T22:16:29Z

Thanks for the review!

I updated https://www.envoyproxy.io/docs/envoy/latest/extending/extending and the "source/extensions layout" section of https://github.com/envoyproxy/envoy/blob/master/REPO_LAYOUT.md.

mattklein123

Thanks a few more comments from another pass. Neat!

/wait

source/common/signal/fatal_error_handler.cc

mattklein123 · 2020-11-17T00:41:20Z

source/common/signal/fatal_error_handler.cc

+  if (fatal_action_manager.compare_exchange_strong(unset_manager, mananger.get(),
+                                                   std::memory_order_acq_rel)) {


This all runs on the main thread during server init. I'm confused why we need anything special here?

mattklein123 · 2020-11-17T00:42:25Z

source/common/signal/fatal_error_handler.cc

+
+// Helper function to run fatal actions.
+void runFatalActions(const FatalAction::FatalActionPtrList& actions) {
+  FailureFunctionList* list = fatal_error_handlers.exchange(nullptr, std::memory_order_relaxed);


Can you add more comments in this function about the atomic operations that we are doing and why we do them?

mattklein123 · 2020-11-17T00:43:29Z

source/common/signal/fatal_error_handler.cc

+  return FatalAction::Status::RunningOnAnotherThread;
+}
+
+FatalAction::Status runUnsafeActions() {


This is basically the same function as the one above. Can you have a functor or pass whether to run safe or unsafe to a common function?

Done, they share a common implementation in an anon. namespace of the module called. Depending on a internal type, we'll run either the safe or unsafe actions.

mattklein123 · 2020-11-17T00:46:11Z

source/common/signal/signal_action.cc

+  default:
+    // All the cases runSafeActions() returns have been covered.
+    NOT_REACHED_GCOVR_EXCL_LINE;


Is this actually needed to compile? Otherwise I would remove.

Yes, the compiler complains about the enum case that's not handled SafeActionsNotYetRan which isn't returned by runSafeStatus.

I think perhaps exhaustively listing them, instead of having a default like here, would be better as that way new enums added to the class will force the author to update this switch statement vs having it silently compile.

Thoughts?

mattklein123 · 2020-11-17T00:48:20Z

source/common/signal/fatal_error_handler.cc

+  } else if (failing_tid == -1) {
+    return FatalAction::Status::SafeActionsNotYetRan;


How can this happen? More comments?

It shouldn't happen -- a lot of the module is hardened to try to prevent incorrect uses by giving guide rails to enforce the state machine of the module -- for example only one fatal action manager can be register, we run safe actions before unsafe actions, only one thread should run these actions, among others.

I added a comment about it, lmk your thoughts

mattklein123 · 2020-11-17T00:49:22Z

source/common/signal/fatal_error_handler.cc

+}
+
+void clearFatalActionsOnTerminate() {
+  auto* raw_ptr = fatal_action_manager.exchange(nullptr, std::memory_order_relaxed);


Is relaxes here and elsewhere correct? I think you want sequential consistency for this? This is not perf critical so I would recommend just using sequential consistency everywhere to be on the safe side.

I put sequential consistency, as you suggested, in the places that are either:

Test only

Expected to be called once

I went over the file and updated the memory order for all calls using relaxed semantics, there's only one place where I think it's entirely safe. I commented the places I updated explaining my reasoning.

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

KBaichoo

Thanks for the review :)

KBaichoo · 2020-11-17T14:07:44Z

source/common/signal/fatal_error_handler.cc

+  } else if (failing_tid == -1) {
+    return FatalAction::Status::SafeActionsNotYetRan;


It shouldn't happen -- a lot of the module is hardened to try to prevent incorrect uses by giving guide rails to enforce the state machine of the module -- for example only one fatal action manager can be register, we run safe actions before unsafe actions, only one thread should run these actions, among others.

I added a comment about it, lmk your thoughts

KBaichoo · 2020-11-17T14:13:30Z

source/common/signal/fatal_error_handler.cc

+  if (fatal_action_manager.compare_exchange_strong(unset_manager, mananger.get(),
+                                                   std::memory_order_acq_rel)) {


It's the argument I discuss below about preventing incorrect use of the module.

As it's currently tied into the system only in main thread server init, this shouldn't be a problem (but it's a bug if we try to register the manager multiple times.)

KBaichoo · 2020-11-17T14:37:18Z

source/common/signal/signal_action.cc

+  default:
+    // All the cases runSafeActions() returns have been covered.
+    NOT_REACHED_GCOVR_EXCL_LINE;


Yes, the compiler complains about the enum case that's not handled SafeActionsNotYetRan which isn't returned by runSafeStatus.

I think perhaps exhaustively listing them, instead of having a default like here, would be better as that way new enums added to the class will force the author to update this switch statement vs having it silently compile.

Thoughts?

KBaichoo · 2020-11-17T20:17:59Z

source/common/signal/fatal_error_handler.cc

+  return FatalAction::Status::RunningOnAnotherThread;
+}
+
+FatalAction::Status runUnsafeActions() {


Done, they share a common implementation in an anon. namespace of the module called. Depending on a internal type, we'll run either the safe or unsafe actions.

KBaichoo · 2020-11-17T20:46:34Z

source/common/signal/fatal_error_handler.cc

+
+// Helper function to run fatal actions.
+void runFatalActions(const FatalAction::FatalActionPtrList& actions) {
+  FailureFunctionList* list = fatal_error_handlers.exchange(nullptr, std::memory_order_relaxed);


KBaichoo · 2020-11-17T21:54:28Z

source/common/signal/fatal_error_handler.cc

+}
+
+void clearFatalActionsOnTerminate() {
+  auto* raw_ptr = fatal_action_manager.exchange(nullptr, std::memory_order_relaxed);


I put sequential consistency, as you suggested, in the places that are either:

Test only

Expected to be called once

I went over the file and updated the memory order for all calls using relaxed semantics, there's only one place where I think it's entirely safe. I commented the places I updated explaining my reasoning.

mattklein123 · 2020-11-17T23:53:29Z

re: module hardening, IMO it just makes the code harder to read and it's unclear how or when anyone is going to use this in a way that it wasn't intended. Can we just replace with ASSERTs and revisit later if people need to use the code in a different way?

re: memory ordering, is there a good reason to not just use the default option for all atomic operations? (Sequential consistency.) Again, it's hard to reason about, memory ordering is incredibly tricky to get correct, and this code is not perf critical at all, so unclear why we would trade potential bugs and harder to read code for slightly better perf?

Otherwise LGTM, thanks.

/wait

… performance focused, replaced cumbersome hardening with asserts. Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

KBaichoo · 2020-11-18T16:34:40Z

/retest

repokitteh-read-only · 2020-11-18T16:34:44Z

Retrying Azure Pipelines:
Check envoy-presubmit isn't fully completed, but will still attempt retrying.
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #13676 (comment) was created by @KBaichoo.

see: more, trace.

KBaichoo · 2020-11-18T17:33:36Z

/retest

repokitteh-read-only · 2020-11-18T17:33:41Z

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #13676 (comment) was created by @KBaichoo.

see: more, trace.

mattklein123

Thanks LGTM with small final change.

/wait

mattklein123 · 2020-11-18T18:41:12Z

source/common/signal/fatal_error_handler.cc

  // Restore the fatal_error_handlers pointer so subsequent calls using the list
  // can succeed.
-  fatal_error_handlers.store(list, std::memory_order_release);
+  fatal_error_handlers.store(list, std::memory_order_seq_cst);


nit: IIRC std::memory_order_seq_cst is the default param for all atomic ops. Can you just remove it everywhere? It will make the code easier to read.

Whoops. Done.

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

mattklein123

Awesome, thanks!

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

Signed-off-by: Kevin Baichoo <kbaichoo@google.com> Signed-off-by: Qin Qin <qqin@google.com>

Added Fatal Action extension point.

1570d2c

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

repokitteh-read-only bot added the api label Oct 21, 2020

repokitteh-read-only bot assigned akonradi Oct 21, 2020

akonradi suggested changes Oct 21, 2020

View reviewed changes

Various changes: cleaning up interfaces, among others.

2ac60e6

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

akonradi reviewed Oct 22, 2020

View reviewed changes

KBaichoo added 6 commits October 26, 2020 13:26

Cleaned up tests, and interfaces.

32fb212

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

added release notes.

341d969

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

Fixed failing tests.

91030d9

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

Added Fatal Actions to extending envoy documentation.

502e609

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

Clang-tidy and test fixes.

ee15888

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

Fixed asan issue.

564668d

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

KBaichoo mentioned this pull request Oct 28, 2020

utilities: Implemented an ostream that writes to a user provided buffer #13797

Merged

Modified death message as it relies on signal handlers that can get h…

c4e105c

…ijacked depending on compile options. Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

akonradi reviewed Oct 28, 2020

View reviewed changes

KBaichoo added 3 commits October 30, 2020 15:28

Changed the FatalAction functions to return a status to account for m…

4170d6d

…ore granularity when reporting failures, modified signalaction to take advantage of it. Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

Fixed Asan hijacking signals for tests where the death test message m…

c5dfe7a

…atters. Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

Merge remote-tracking branch 'upstream/master' into fh-extension-pt

5faddf0

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

akonradi previously approved these changes Nov 5, 2020

View reviewed changes

source/common/signal/fatal_error_handler.h Outdated Show resolved Hide resolved

source/common/event/dispatcher_impl.h Outdated Show resolved Hide resolved

test/common/signal/signals_test.cc Outdated Show resolved Hide resolved

Minor nits.

9bc90ba

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

KBaichoo dismissed akonradi’s stale review via 9bc90ba November 5, 2020 15:51

KBaichoo added 3 commits November 5, 2020 15:53

Merge remote-tracking branch 'upstream/master' into fh-extension-pt

a35f53a

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

Made test size larger to see if that helps with windows timeout.

4f96adf

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

Removed annotations to debug test.

d8a605d

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

htuch reviewed Nov 9, 2020

View reviewed changes

repokitteh-read-only bot added the waiting label Nov 16, 2020

KBaichoo added 2 commits November 16, 2020 18:52

Merge remote-tracking branch 'upstream/master' into fh-extension-pt

bff6e4f

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

Minor comments, updated docs.

b023b5c

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

repokitteh-read-only bot removed the waiting label Nov 16, 2020

mattklein123 requested changes Nov 17, 2020

View reviewed changes

repokitteh-read-only bot added the waiting label Nov 17, 2020

Cleaned up code, updated relaxed memory order usages.

bcaee64

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

repokitteh-read-only bot removed the waiting label Nov 17, 2020

KBaichoo commented Nov 17, 2020

View reviewed changes

repokitteh-read-only bot added the waiting label Nov 17, 2020

Moved atomic ops to use sequential consistency since the module isn't…

0dce2af

… performance focused, replaced cumbersome hardening with asserts. Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

repokitteh-read-only bot removed the waiting label Nov 18, 2020

mattklein123 requested changes Nov 18, 2020

View reviewed changes

repokitteh-read-only bot added the waiting label Nov 18, 2020

Removed redundant seq_cst since atomic ops have it as the default param.

9fcfd5f

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

repokitteh-read-only bot removed the waiting label Nov 18, 2020

mattklein123 approved these changes Nov 18, 2020

View reviewed changes

repokitteh-read-only bot removed the api label Nov 18, 2020

mattklein123 merged commit 6e5227e into envoyproxy:master Nov 18, 2020

andreyprezotto pushed a commit to andreyprezotto/envoy that referenced this pull request Nov 24, 2020

Added Fatal Action extension point. (envoyproxy#13676)

5dd7bd8

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

qqustc pushed a commit to qqustc/envoy that referenced this pull request Nov 24, 2020

Added Fatal Action extension point. (envoyproxy#13676)

53f0a19

Signed-off-by: Kevin Baichoo <kbaichoo@google.com> Signed-off-by: Qin Qin <qqin@google.com>

rgs1 mentioned this pull request Jan 4, 2021

access log: add support for command formatter extensions #14512

Merged

		FatalActionManager::FatalActionManager(FatalActionPtrList& safe_actions,
		FatalActionPtrList& unsafe_actions, Server::Instance* server)

		if (fatal_action_manager.compare_exchange_strong(unset_manager, mananger.get(),
		std::memory_order_acq_rel)) {

		FatalErrorHandler::registerFatalActions(std::move(safe_actions_), std::move(unsafe_actions_),
		Thread::threadFactoryForTest());

		} else if (failing_tid == -1) {
		return FatalAction::Status::SafeActionsNotYetRan;

Conversation

KBaichoo commented Oct 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

repokitteh-read-only bot commented Oct 21, 2020

Uh oh!

KBaichoo commented Oct 21, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KBaichoo Oct 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

KBaichoo commented Oct 28, 2020

Uh oh!

repokitteh-read-only bot commented Oct 28, 2020

Uh oh!

akonradi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

akonradi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KBaichoo commented Nov 5, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KBaichoo commented Nov 16, 2020

Uh oh!

mattklein123 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

KBaichoo commented Oct 21, 2020 •

edited

Loading

KBaichoo Oct 22, 2020 •

edited

Loading