[server] add unused ENVOY_BUG implementation by asraa · Pull Request #11503 · envoyproxy/envoy

asraa · 2020-06-08T19:28:31Z

Commit Message: Adds an implementation for ENVOY_BUG macro that (in contrast with ASSERT) is compiled in release mode. If a failure is met in release mode, it is logged at a critical level and a stat is incremented with exponential back-off. In debug mode, this aborts like ASSERT.

This is meant to be used for performant conditions that shouldn't be met in normal circumstances. This allows users in production to monitor the logs and stats for ENVOY_BUG conditions in production.

Risk Level: Low
Testing: Added server test testing stats
Release notes: Added note about new server stat, update stats doc

Signed-off-by: Asra Ali asraa@google.com

Signed-off-by: Asra Ali <asraa@google.com>

asraa · 2020-06-08T20:13:21Z

@curiouserrandy @antoniovicente

source/common/common/assert.h

test/common/common/assert_test.cc

source/common/common/assert.h

Signed-off-by: Asra Ali <asraa@google.com>

antoniovicente

Looks good otherwise. Thanks so much for making these changes.

source/common/common/assert.cc

antoniovicente · 2020-06-09T14:27:44Z

test/common/common/assert_test.cc

+  EXPECT_DEATH({ ENVOY_BUG(false); }, ".*envoy bug failure: 0.*");
+  EXPECT_DEATH({ ENVOY_BUG(false, ""); }, ".*envoy bug failure: 0.*");
+  EXPECT_DEATH({ ENVOY_BUG(false, "With some logs"); },
+               ".*envoy bug failure: 0. Details: With some logs.*");


informational only, no action needed:

There's something interesting about ENVOY_BUG in debug modes that we may want to comment about here:
Exponential backoff doesn't seem to be triggering for some reason. Fortunately, that's a good thing: we want ENVOY_BUG to act like an ASSERT always in debug modes so we can more easily detect when it happens and ensure that our test cases do not become order dependent. I think this difference in behavior is related to the use of EXPECT_DEATH which does a fork; count is at 0 when each of the ENVOY_BUGs above execute.

Possible suggestion: it would be good to have macros that make it easier to test for code triggering ENVOY_BUG. I guess we can use EXPECT_DEBUG_DEATH until a more specialized macro exists.

source/server/server.h

test/server/server_test.cc

Signed-off-by: Asra Ali <asraa@google.com>

ggreenway · 2020-06-09T19:44:21Z

This sounds roughly equivalent to ENVOY_LOG_DEBUG_ASSERT_IN_RELEASE, which I added a long time ago, except with a different macro so that it doesn't apply to all ASSERTs. Can anything be shared between the two, or should ENVOY_LOG_DEBUG_ASSERT_IN_RELEASE be deprecated?

asraa · 2020-06-09T20:12:55Z

Can anything be shared between the two, or should ENVOY_LOG_DEBUG_ASSERT_IN_RELEASE be deprecated?

Nice question. For re-use, what about when ENVOY_LOG_DEBUG_ASSERT_IN_RELEASE is defined, all ASSERTs are treated as ENVOY_BUGs (log failure, increment counter)? Does that align with the use of ENVOY_LOG_DEBUG_ASSERT_IN_RELEASE? This would change ASSERT behavior, since ENVOY_BUG has exponential back-off behavior for counter increment and logging. But I think that makes sense for ENVOY_LOG_DEBUG_ASSERT_IN_RELEASE mode as well.

That would also unify the stats.

/cc @curiouserrandy

Signed-off-by: Asra Ali <asraa@google.com>

antoniovicente · 2020-06-09T20:19:58Z

This sounds roughly equivalent to ENVOY_LOG_DEBUG_ASSERT_IN_RELEASE, which I added a long time ago, except with a different macro so that it doesn't apply to all ASSERTs. Can anything be shared between the two, or should ENVOY_LOG_DEBUG_ASSERT_IN_RELEASE be deprecated?

One issue with ENVOY_LOG_DEBUG_ASSERT_IN_RELEASE is the performance cost of enabling this behavior for all ASSERTs. The two modes satisfy slightly different needs.

source/common/common/assert.cc

ggreenway

I like the idea of checking more conditions in a non-fatal way in release builds.

In my opinion, we should move every ASSERT that isn't in a performance critical path to this mechanism. Would it make sense to make ENVOY_BUG into ASSERT, and then have EXPENSIVE_ASSERT or similar for the cases we identify that are performance sensitive?

ggreenway · 2020-06-09T20:57:08Z

source/common/common/assert.cc

+  static bool shouldLogAndInvoke() {
+    ++count_;
+    // Check if count_ is power of two by its bitwise representation.
+    if ((count_ & (count_ - 1)) == 0) {


You should latch count_ after incrementing (I think there's a method on std::atomic to increment and return the value). Otherwise this could change between when you set it and where you check it here, and since you use it twice, it could be change between the two accesses.

Good point... Something like this should work:

auto counter_value = ++count_;
if ((counter_value & (counter_value - 1)) == 0) {

Could you take a look at this portion again? Is guarding counters_ by a mutex ok?

I think use of a map and mutex for counters is risky even though we expect these events to be rare. You don't get the benefit of having the exponential backoff reduce the effective cost of operations inside a mutex to 0.

Should you consider having a static member defined inside the macro scope to hold the per-line counter?

Alternative that may provide a centralized list of counters that have been hit, provide a way to reset the counters and avoid mutex acquisition outside first hit:

Add a "AtomicCounter* GetCounter(file + line)" static method that gets or creates a counter by name under a mutex.

Under the body of "if (CONDITION) {" have "static AtomicCounter* counter = GetCounter(file, line);" to create the counter on first access.

Add a function to set all counters know to the map back to 0 between tests.

I have a working version like above without static init for the counter, because resetting the counters via a map clearing method doesn't have an effect when i have static initialization like "static AtomicCounterPtr ptr_for_file_line".

Is it possible to avoid map lookup (the most expensive mutex locked operation) for each iterations while still allowing a working map reset? Former solved by statics, latter nullifies the statics solution.

because resetting the counters via a map clearing method doesn't have an effect when i have static initialization like "static AtomicCounterPtr ptr_for_file_line".

Reset would need to iterate through the map and set all counters to zero, not erase anything in the map.

I agree with @ahedberg though that if we get to the point that we're contending on this mutex, we have big problems. I don't think having a static is worth it.

Use of a mutex and map makes me uncomfortable. Resetting counters to 0 between does not seem like a realistic goal.

I'd be ok with either (static, or map+mutex). I don't feel that strongly about it. I agree that the need for reset can be removed by adding a flag (used only by tests) to not do backoff and take the action on every failure of the condition.

If we do the static std::atomic, I'd like to keep the macro-generated code as small as possible, which can be done by passing the atomic as an argument to shouldLogAndInvokeEnvoyBugForEnvoyBugMacroUseOnly instead of the file/line number.

Alright -- the current is my map+mutex implementation, i moved the mutex lock to just over the map access.

I have a version fully implemented with static atomics and counter resets (this was easier than a flag, when dealing with parametrized tests). It does indeed pass the atomic to the shouldLogAndInvokeEnvoyBugForEnvoyBugMacroUseOnly. Should I push that just for review? Not sure.

ggreenway · 2020-06-09T21:07:46Z

Can anything be shared between the two, or should ENVOY_LOG_DEBUG_ASSERT_IN_RELEASE be deprecated?

Nice question. For re-use, what about when ENVOY_LOG_DEBUG_ASSERT_IN_RELEASE is defined, all ASSERTs are treated as ENVOY_BUGs (log failure, increment counter)? Does that align with the use of ENVOY_LOG_DEBUG_ASSERT_IN_RELEASE?

Yeah, I think that would make sense. I think they could use the same stat; they're both indicating roughly the same thing.

This would change ASSERT behavior, since ENVOY_BUG has exponential back-off behavior for counter increment and logging. But I think that makes sense for ENVOY_LOG_DEBUG_ASSERT_IN_RELEASE mode as well.

Agreed.

mattklein123 · 2020-06-09T23:25:21Z

In my opinion, we should move every ASSERT that isn't in a performance critical path to this mechanism. Would it make sense to make ENVOY_BUG into ASSERT, and then have EXPENSIVE_ASSERT or similar for the cases we identify that are performance sensitive?

Theoretically this sounds nice, but practically I think this will be incredibly difficult to audit that we would feel comfortable with a bulk replace. I would recommend we proceed with this new thing and see how many places we actually end up wanting to use it.

Signed-off-by: Asra Ali <asraa@google.com>

asraa · 2020-06-10T19:33:18Z

Changed to per-line back-off.

Regarding doing this for ASSERTs in ENVOY_LOG_DEBUG_ASSERT_IN_RELEASE mode as well. I had wanted to re-use stats counters for implementing the per-line counting, but stats depends on assert_lib. There's a mutex guarded flat hash map there now under the assumption we don't trigger ENVOY_BUG failures often.

Is it clear that we don't hit ASSERTs often in prod, enough that using this flat hash map is enough of a performance hit? If so, will unify the stats as mentioned before. @ggreenway

Theoretically this sounds nice, but practically I think this will be incredibly difficult to audit that we would feel comfortable with a bulk replace.

Agree. I assume it'll be incremental adoption.

ggreenway

Theoretically this sounds nice, but practically I think this will be incredibly difficult to audit that we would feel comfortable with a bulk replace. I would recommend we proceed with this new thing and see how many places we actually end up wanting to use it.

Fair point.

I think use of a map and mutex for counters is risky even though we expect these events to be rare. You don't get the benefit of having the exponential backoff reduce the effective cost of operations inside a mutex to 0.

Should you consider having a static member defined inside the macro scope to hold the per-line counter?

Given that the mutex is only grabbed in the event that the condition fails, which should be exceedingly rare, I think it's fine from a performance point of view.

The external map+mutex approach also has the advantage that it's easy to reset between tests.

source/common/common/assert.cc

antoniovicente · 2020-06-10T20:21:02Z

Theoretically this sounds nice, but practically I think this will be incredibly difficult to audit that we would feel comfortable with a bulk replace. I would recommend we proceed with this new thing and see how many places we actually end up wanting to use it.

Fair point.

I think use of a map and mutex for counters is risky even though we expect these events to be rare. You don't get the benefit of having the exponential backoff reduce the effective cost of operations inside a mutex to 0.
Should you consider having a static member defined inside the macro scope to hold the per-line counter?

Given that the mutex is only grabbed in the event that the condition fails, which should be exceedingly rare, I think it's fine from a performance point of view.

The external map+mutex approach also has the advantage that it's easy to reset between tests.

I worry that it will be exceedingly rare until it isn't. We are doing exponential backoff for a reason.

Regarding tests, our usual approach for having consistent behavior in tests that exercise this functionality is to have a mechanism to disable exponential backoff in tests. The fallback to abort behavior in debug modes doesn't do exponential backoff and thus works very nicely, but does require use of EXPECT_DEBUG_DEATH.

Signed-off-by: Asra Ali <asraa@google.com>

antoniovicente · 2020-06-12T20:08:57Z

I'm not sure how to proceed here. Greg/Ashley: Do you want to take over this review? I'm fine excusing myself from it.

mattklein123 · 2020-06-12T21:29:15Z

source/common/common/assert.cc

+  }
+
+  static bool shouldLogAndInvoke(const char* filename, int line) {
+    const auto name = absl::StrCat(filename, ",", line);


perf nit: I would probably pre-stringify the file+line in the macro, since I think the incremental code size is probably about 0, then you don't need to do a potential allocation here.

mattklein123 · 2020-06-12T21:33:47Z

source/common/common/assert.cc

+  static bool shouldLogAndInvoke(const char* filename, int line) {
+    const auto name = absl::StrCat(filename, ",", line);
+
+    // Increment counter, inserting first if counter does not exist.


I don't think I have a strong opinion on the map vs. inline static, but can you add more comments here on the trade-offs and why we chose this way? I assume it's to reduce code size and I guess per the discussion avoid cache misses on the inline atomic?

It's not that we're trying to avoid cache misses on the atomic; if we aren't worried about cache misses in this part of Envoy, and if we don't have benchmarks proving that atomics are significantly more efficient, then we don't have a performance argument for choosing atomics over a higher-level, easier-to-understand construct like a mutex.

Quoting from an internal doc on the dangers of atomics that I've been meaning to open-source: There's a common assumption that mutexes are expensive, and that using atomic operations will be more efficient. But in reality, acquiring and releasing a mutex is cheaper than a cache miss; attention to cache behavior is usually a more fruitful way to improve performance.

Will do.
Quick question: The biggest different I can see is that this map+mutex impl holds the mutex for the entire map, not just the particular file/line counter.

Is the solution to have some kind of static object per file/line in the macro that holds a mutex per file/line and makes accesses to the map? Or is that starting to get too complicated.

IMO, a per file/line mutex/object/whatever is more complicated than is necessary at the moment. If we start seeing contention on this mutex, we'd have a lot of ENVOY_BUGs firing at once, which is worth investigation on its own. But I don't think we need to prematurely optimize this.

Added a comment about this.

I would just add a comment explaining that this is not performance critical path and contention on this mutex would imply that something horribly went off the rails and caused ENVOY_BUG to fire often on multiple threads.

Signed-off-by: Asra Ali <asraa@google.com>

ggreenway · 2020-06-16T15:58:49Z

@asraa you have a merge conflict on the version history

Signed-off-by: Asra Ali <asraa@google.com>

yanavlasov · 2020-06-16T16:33:21Z

source/common/common/assert.h

+#define _ENVOY_BUG_IMPL(CONDITION, CONDITION_STR, ACTION, DETAILS)                                 \
+  do {                                                                                             \
+    if (!(CONDITION) && Envoy::Assert::shouldLogAndInvokeEnvoyBugForEnvoyBugMacroUseOnly(          \
+                            __FILE__ ":" TOSTRING(__LINE__))) {                                    \


We have to make sure that windows build uses appropriate compiler flags, as by default FILE does not have full path, just the filename. @sunjayBhatia @wrowe

Is this issue of the lack of uniqueness in the contents of FILE an argument to re-consider use of static atomics for these counters?

Wouldn't counters have the same issue, since they'd be initialized according to file/line

the regular component/misc logging uses __FILE__ and shows the full path on Windows:

[2020-06-16 18:13:24.358][14656][debug][misc] [source/common/network/io_socket_error_impl.cc:30] Unknown error code 123 details...

Can't find specifically where we pass /FC to MSVC, but seems that it is happening

mattklein123

Nice this is great. Just a small comment about end-user documentation. Thank you!

/wait

mattklein123 · 2020-06-17T04:10:44Z

docs/root/configuration/observability/statistics.rst

  hot_restart_generation, Gauge, Current hot restart generation -- like hot_restart_epoch but computed automatically by incrementing from parent.
  initialization_time_ms, Histogram, Total time taken for Envoy initialization in milliseconds. This is the time from server start-up until the worker threads are ready to accept new connections
  debug_assertion_failures, Counter, Number of debug assertion failures detected in a release build if compiled with `--define log_debug_assert_in_release=enabled` or zero otherwise
+  envoy_bug_failures, Counter, Number of envoy bug failures detected in a release build.


I don't think this is going to mean anything to a normal user. Should we link somewhere that describes this in a bit more detail and what to do if this increments? Presumably open an issue as there is a serious issue, etc.?

done -- I also added some more doc string explanation in assert.h about its contrast with ASSERT.
I'd like to link that in, is that an appropriate place? Or link this PR?

Signed-off-by: Asra Ali <asraa@google.com>

mattklein123

LGTM other than a typo (I think).

/wait

mattklein123 · 2020-06-17T21:49:45Z

source/common/common/assert.h


 /**
- * ENVOY_BUG must be called with two arguments for verbose logging.
+ * Indicate a efficient condition that should never be met in normal circumstances. In contrast


Suggested change

* Indicate a efficient condition that should never be met in normal circumstances. In contrast

* Indicate a failure condition that should never be met in normal circumstances. In contrast

?

Signed-off-by: Asra Ali <asraa@google.com>

stale · 2020-06-26T12:50:31Z

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

asraa · 2020-06-26T12:58:22Z

ping? I can merge master again if needed. vaguely afraid of triggering tsan failure though

Signed-off-by: Asra Ali <asraa@google.com>

asraa added 3 commits June 8, 2020 14:45

add unused envoy bug implementation

ff7ca4f

Signed-off-by: Asra Ali <asraa@google.com>

add more tests

798bb0e

Signed-off-by: Asra Ali <asraa@google.com>

fix

8f9aba6

Signed-off-by: Asra Ali <asraa@google.com>

mattklein123 reviewed Jun 8, 2020

View reviewed changes

source/common/common/assert.h Show resolved Hide resolved

antoniovicente reviewed Jun 8, 2020

View reviewed changes

test/common/common/assert_test.cc Outdated Show resolved Hide resolved

source/common/common/assert.h Outdated Show resolved Hide resolved

only log on power of two

1ed929d

Signed-off-by: Asra Ali <asraa@google.com>

mattklein123 assigned antoniovicente Jun 8, 2020

only verbose ENVOY_BUG

e3b47af

Signed-off-by: Asra Ali <asraa@google.com>

antoniovicente reviewed Jun 9, 2020

View reviewed changes

address comments, docs

c5ab825

Signed-off-by: Asra Ali <asraa@google.com>

antoniovicente previously approved these changes Jun 9, 2020

View reviewed changes

fix compile time and clang tidy

00e851e

Signed-off-by: Asra Ali <asraa@google.com>

asraa dismissed antoniovicente’s stale review via 00e851e June 9, 2020 20:18

antoniovicente reviewed Jun 9, 2020

View reviewed changes

source/common/common/assert.cc Outdated Show resolved Hide resolved

ggreenway requested changes Jun 9, 2020

View reviewed changes

asraa added 2 commits June 10, 2020 15:19

implement exponential back-off per line

26cbe25

Signed-off-by: Asra Ali <asraa@google.com>

use member mutex

1adce9f

Signed-off-by: Asra Ali <asraa@google.com>

ggreenway requested changes Jun 10, 2020

View reviewed changes

source/common/common/assert.cc Outdated Show resolved Hide resolved

asraa added 2 commits June 11, 2020 13:38

just mutex lock the map access and address greg's comment

ce70b00

Signed-off-by: Asra Ali <asraa@google.com>

fix ci

aacf835

Signed-off-by: Asra Ali <asraa@google.com>

mattklein123 reviewed Jun 12, 2020

View reviewed changes

address comments

9f036bb

Signed-off-by: Asra Ali <asraa@google.com>

asraa dismissed ahedberg’s stale review via 9f036bb June 16, 2020 15:55

ggreenway previously approved these changes Jun 16, 2020

View reviewed changes

asraa added 2 commits June 16, 2020 12:06

Merge remote-tracking branch 'upstream/master' into add-envoy-bug

4b2b808

Signed-off-by: Asra Ali <asraa@google.com>

fix docs

6ba5df6

Signed-off-by: Asra Ali <asraa@google.com>

asraa dismissed ggreenway’s stale review via 6ba5df6 June 16, 2020 16:12

ggreenway previously approved these changes Jun 16, 2020

View reviewed changes

yanavlasov reviewed Jun 16, 2020

View reviewed changes

mattklein123 requested changes Jun 17, 2020

View reviewed changes

repokitteh-read-only bot added the waiting label Jun 17, 2020

add comments

82afac3

Signed-off-by: Asra Ali <asraa@google.com>

asraa dismissed ggreenway’s stale review via 82afac3 June 17, 2020 18:14

repokitteh-read-only bot removed the waiting label Jun 17, 2020

fix comma

2b1e871

Signed-off-by: Asra Ali <asraa@google.com>

mattklein123 requested changes Jun 17, 2020

View reviewed changes

repokitteh-read-only bot added the waiting label Jun 17, 2020

fixup

3cdbc57

Signed-off-by: Asra Ali <asraa@google.com>

repokitteh-read-only bot removed the waiting label Jun 18, 2020

stale bot added the stale stalebot believes this issue/PR has not been touched recently label Jun 26, 2020

stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Jun 26, 2020

mattklein123 self-assigned this Jun 26, 2020

ggreenway approved these changes Jun 26, 2020

View reviewed changes

mattklein123 approved these changes Jun 26, 2020

View reviewed changes

Merge remote-tracking branch 'upstream/master' into add-envoy-bug

2d2f5ba

Signed-off-by: Asra Ali <asraa@google.com>

mattklein123 merged commit 1c1ce18 into envoyproxy:master Jun 28, 2020

	* Indicate a efficient condition that should never be met in normal circumstances. In contrast
	* Indicate a failure condition that should never be met in normal circumstances. In contrast

Conversation

asraa commented Jun 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asraa commented Jun 8, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

antoniovicente left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ggreenway commented Jun 9, 2020

Uh oh!

asraa commented Jun 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

antoniovicente commented Jun 9, 2020

Uh oh!

Uh oh!

ggreenway left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

asraa Jun 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggreenway commented Jun 9, 2020

Uh oh!

mattklein123 commented Jun 9, 2020

Uh oh!

asraa commented Jun 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggreenway left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

antoniovicente commented Jun 10, 2020

Uh oh!

antoniovicente commented Jun 12, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

asraa commented Jun 8, 2020 •

edited

Loading

asraa commented Jun 9, 2020 •

edited

Loading

asraa Jun 11, 2020 •

edited

Loading

asraa commented Jun 10, 2020 •

edited

Loading