Skip to content

[server] add unused ENVOY_BUG implementation#11503

Merged
mattklein123 merged 21 commits intoenvoyproxy:masterfrom
asraa:add-envoy-bug
Jun 28, 2020
Merged

[server] add unused ENVOY_BUG implementation#11503
mattklein123 merged 21 commits intoenvoyproxy:masterfrom
asraa:add-envoy-bug

Conversation

@asraa
Copy link
Contributor

@asraa asraa commented Jun 8, 2020

Commit Message: Adds an implementation for ENVOY_BUG macro that (in contrast with ASSERT) is compiled in release mode. If a failure is met in release mode, it is logged at a critical level and a stat is incremented with exponential back-off. In debug mode, this aborts like ASSERT.

This is meant to be used for performant conditions that shouldn't be met in normal circumstances. This allows users in production to monitor the logs and stats for ENVOY_BUG conditions in production.

Risk Level: Low
Testing: Added server test testing stats
Release notes: Added note about new server stat, update stats doc

Signed-off-by: Asra Ali asraa@google.com

asraa added 3 commits June 8, 2020 14:45
Signed-off-by: Asra Ali <asraa@google.com>
Signed-off-by: Asra Ali <asraa@google.com>
Signed-off-by: Asra Ali <asraa@google.com>
@asraa
Copy link
Contributor Author

asraa commented Jun 8, 2020

Signed-off-by: Asra Ali <asraa@google.com>
Signed-off-by: Asra Ali <asraa@google.com>
Copy link
Contributor

@antoniovicente antoniovicente left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good otherwise. Thanks so much for making these changes.

EXPECT_DEATH({ ENVOY_BUG(false); }, ".*envoy bug failure: 0.*");
EXPECT_DEATH({ ENVOY_BUG(false, ""); }, ".*envoy bug failure: 0.*");
EXPECT_DEATH({ ENVOY_BUG(false, "With some logs"); },
".*envoy bug failure: 0. Details: With some logs.*");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

informational only, no action needed:

There's something interesting about ENVOY_BUG in debug modes that we may want to comment about here:
Exponential backoff doesn't seem to be triggering for some reason. Fortunately, that's a good thing: we want ENVOY_BUG to act like an ASSERT always in debug modes so we can more easily detect when it happens and ensure that our test cases do not become order dependent. I think this difference in behavior is related to the use of EXPECT_DEATH which does a fork; count is at 0 when each of the ENVOY_BUGs above execute.

Possible suggestion: it would be good to have macros that make it easier to test for code triggering ENVOY_BUG. I guess we can use EXPECT_DEBUG_DEATH until a more specialized macro exists.

Signed-off-by: Asra Ali <asraa@google.com>
antoniovicente
antoniovicente previously approved these changes Jun 9, 2020
@ggreenway
Copy link
Member

This sounds roughly equivalent to ENVOY_LOG_DEBUG_ASSERT_IN_RELEASE, which I added a long time ago, except with a different macro so that it doesn't apply to all ASSERTs. Can anything be shared between the two, or should ENVOY_LOG_DEBUG_ASSERT_IN_RELEASE be deprecated?

@asraa
Copy link
Contributor Author

asraa commented Jun 9, 2020

Can anything be shared between the two, or should ENVOY_LOG_DEBUG_ASSERT_IN_RELEASE be deprecated?

Nice question. For re-use, what about when ENVOY_LOG_DEBUG_ASSERT_IN_RELEASE is defined, all ASSERTs are treated as ENVOY_BUGs (log failure, increment counter)? Does that align with the use of ENVOY_LOG_DEBUG_ASSERT_IN_RELEASE? This would change ASSERT behavior, since ENVOY_BUG has exponential back-off behavior for counter increment and logging. But I think that makes sense for ENVOY_LOG_DEBUG_ASSERT_IN_RELEASE mode as well.

That would also unify the stats.

/cc @curiouserrandy

Signed-off-by: Asra Ali <asraa@google.com>
@antoniovicente
Copy link
Contributor

This sounds roughly equivalent to ENVOY_LOG_DEBUG_ASSERT_IN_RELEASE, which I added a long time ago, except with a different macro so that it doesn't apply to all ASSERTs. Can anything be shared between the two, or should ENVOY_LOG_DEBUG_ASSERT_IN_RELEASE be deprecated?

One issue with ENVOY_LOG_DEBUG_ASSERT_IN_RELEASE is the performance cost of enabling this behavior for all ASSERTs. The two modes satisfy slightly different needs.

Copy link
Member

@ggreenway ggreenway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of checking more conditions in a non-fatal way in release builds.

In my opinion, we should move every ASSERT that isn't in a performance critical path to this mechanism. Would it make sense to make ENVOY_BUG into ASSERT, and then have EXPENSIVE_ASSERT or similar for the cases we identify that are performance sensitive?

static bool shouldLogAndInvoke() {
++count_;
// Check if count_ is power of two by its bitwise representation.
if ((count_ & (count_ - 1)) == 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should latch count_ after incrementing (I think there's a method on std::atomic to increment and return the value). Otherwise this could change between when you set it and where you check it here, and since you use it twice, it could be change between the two accesses.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point... Something like this should work:

auto counter_value = ++count_;
if ((counter_value & (counter_value - 1)) == 0) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you take a look at this portion again? Is guarding counters_ by a mutex ok?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think use of a map and mutex for counters is risky even though we expect these events to be rare. You don't get the benefit of having the exponential backoff reduce the effective cost of operations inside a mutex to 0.

Should you consider having a static member defined inside the macro scope to hold the per-line counter?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternative that may provide a centralized list of counters that have been hit, provide a way to reset the counters and avoid mutex acquisition outside first hit:

  • Add a "AtomicCounter* GetCounter(file + line)" static method that gets or creates a counter by name under a mutex.
  • Under the body of "if (CONDITION) {" have "static AtomicCounter* counter = GetCounter(file, line);" to create the counter on first access.
  • Add a function to set all counters know to the map back to 0 between tests.

Copy link
Contributor Author

@asraa asraa Jun 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a working version like above without static init for the counter, because resetting the counters via a map clearing method doesn't have an effect when i have static initialization like "static AtomicCounterPtr ptr_for_file_line".

Is it possible to avoid map lookup (the most expensive mutex locked operation) for each iterations while still allowing a working map reset? Former solved by statics, latter nullifies the statics solution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because resetting the counters via a map clearing method doesn't have an effect when i have static initialization like "static AtomicCounterPtr ptr_for_file_line".

Reset would need to iterate through the map and set all counters to zero, not erase anything in the map.

I agree with @ahedberg though that if we get to the point that we're contending on this mutex, we have big problems. I don't think having a static is worth it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use of a mutex and map makes me uncomfortable. Resetting counters to 0 between does not seem like a realistic goal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be ok with either (static, or map+mutex). I don't feel that strongly about it. I agree that the need for reset can be removed by adding a flag (used only by tests) to not do backoff and take the action on every failure of the condition.

If we do the static std::atomic, I'd like to keep the macro-generated code as small as possible, which can be done by passing the atomic as an argument to shouldLogAndInvokeEnvoyBugForEnvoyBugMacroUseOnly instead of the file/line number.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright -- the current is my map+mutex implementation, i moved the mutex lock to just over the map access.

I have a version fully implemented with static atomics and counter resets (this was easier than a flag, when dealing with parametrized tests). It does indeed pass the atomic to the shouldLogAndInvokeEnvoyBugForEnvoyBugMacroUseOnly. Should I push that just for review? Not sure.

@ggreenway
Copy link
Member

Can anything be shared between the two, or should ENVOY_LOG_DEBUG_ASSERT_IN_RELEASE be deprecated?

Nice question. For re-use, what about when ENVOY_LOG_DEBUG_ASSERT_IN_RELEASE is defined, all ASSERTs are treated as ENVOY_BUGs (log failure, increment counter)? Does that align with the use of ENVOY_LOG_DEBUG_ASSERT_IN_RELEASE?

Yeah, I think that would make sense. I think they could use the same stat; they're both indicating roughly the same thing.

This would change ASSERT behavior, since ENVOY_BUG has exponential back-off behavior for counter increment and logging. But I think that makes sense for ENVOY_LOG_DEBUG_ASSERT_IN_RELEASE mode as well.

Agreed.

@mattklein123
Copy link
Member

In my opinion, we should move every ASSERT that isn't in a performance critical path to this mechanism. Would it make sense to make ENVOY_BUG into ASSERT, and then have EXPENSIVE_ASSERT or similar for the cases we identify that are performance sensitive?

Theoretically this sounds nice, but practically I think this will be incredibly difficult to audit that we would feel comfortable with a bulk replace. I would recommend we proceed with this new thing and see how many places we actually end up wanting to use it.

asraa added 2 commits June 10, 2020 15:19
Signed-off-by: Asra Ali <asraa@google.com>
Signed-off-by: Asra Ali <asraa@google.com>
@asraa
Copy link
Contributor Author

asraa commented Jun 10, 2020

Changed to per-line back-off.

Regarding doing this for ASSERTs in ENVOY_LOG_DEBUG_ASSERT_IN_RELEASE mode as well. I had wanted to re-use stats counters for implementing the per-line counting, but stats depends on assert_lib. There's a mutex guarded flat hash map there now under the assumption we don't trigger ENVOY_BUG failures often.

Is it clear that we don't hit ASSERTs often in prod, enough that using this flat hash map is enough of a performance hit? If so, will unify the stats as mentioned before. @ggreenway

Theoretically this sounds nice, but practically I think this will be incredibly difficult to audit that we would feel comfortable with a bulk replace.

Agree. I assume it'll be incremental adoption.

Copy link
Member

@ggreenway ggreenway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically this sounds nice, but practically I think this will be incredibly difficult to audit that we would feel comfortable with a bulk replace. I would recommend we proceed with this new thing and see how many places we actually end up wanting to use it.

Fair point.

I think use of a map and mutex for counters is risky even though we expect these events to be rare. You don't get the benefit of having the exponential backoff reduce the effective cost of operations inside a mutex to 0.

Should you consider having a static member defined inside the macro scope to hold the per-line counter?

Given that the mutex is only grabbed in the event that the condition fails, which should be exceedingly rare, I think it's fine from a performance point of view.

The external map+mutex approach also has the advantage that it's easy to reset between tests.

@antoniovicente
Copy link
Contributor

Theoretically this sounds nice, but practically I think this will be incredibly difficult to audit that we would feel comfortable with a bulk replace. I would recommend we proceed with this new thing and see how many places we actually end up wanting to use it.

Fair point.

I think use of a map and mutex for counters is risky even though we expect these events to be rare. You don't get the benefit of having the exponential backoff reduce the effective cost of operations inside a mutex to 0.
Should you consider having a static member defined inside the macro scope to hold the per-line counter?

Given that the mutex is only grabbed in the event that the condition fails, which should be exceedingly rare, I think it's fine from a performance point of view.

The external map+mutex approach also has the advantage that it's easy to reset between tests.

I worry that it will be exceedingly rare until it isn't. We are doing exponential backoff for a reason.

Regarding tests, our usual approach for having consistent behavior in tests that exercise this functionality is to have a mechanism to disable exponential backoff in tests. The fallback to abort behavior in debug modes doesn't do exponential backoff and thus works very nicely, but does require use of EXPECT_DEBUG_DEATH.

asraa added 2 commits June 11, 2020 13:38
Signed-off-by: Asra Ali <asraa@google.com>
Signed-off-by: Asra Ali <asraa@google.com>
@antoniovicente
Copy link
Contributor

I'm not sure how to proceed here. Greg/Ashley: Do you want to take over this review? I'm fine excusing myself from it.

}

static bool shouldLogAndInvoke(const char* filename, int line) {
const auto name = absl::StrCat(filename, ",", line);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perf nit: I would probably pre-stringify the file+line in the macro, since I think the incremental code size is probably about 0, then you don't need to do a potential allocation here.

static bool shouldLogAndInvoke(const char* filename, int line) {
const auto name = absl::StrCat(filename, ",", line);

// Increment counter, inserting first if counter does not exist.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I have a strong opinion on the map vs. inline static, but can you add more comments here on the trade-offs and why we chose this way? I assume it's to reduce code size and I guess per the discussion avoid cache misses on the inline atomic?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not that we're trying to avoid cache misses on the atomic; if we aren't worried about cache misses in this part of Envoy, and if we don't have benchmarks proving that atomics are significantly more efficient, then we don't have a performance argument for choosing atomics over a higher-level, easier-to-understand construct like a mutex.

Quoting from an internal doc on the dangers of atomics that I've been meaning to open-source: There's a common assumption that mutexes are expensive, and that using atomic operations will be more efficient. But in reality, acquiring and releasing a mutex is cheaper than a cache miss; attention to cache behavior is usually a more fruitful way to improve performance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.
Quick question: The biggest different I can see is that this map+mutex impl holds the mutex for the entire map, not just the particular file/line counter.

Is the solution to have some kind of static object per file/line in the macro that holds a mutex per file/line and makes accesses to the map? Or is that starting to get too complicated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, a per file/line mutex/object/whatever is more complicated than is necessary at the moment. If we start seeing contention on this mutex, we'd have a lot of ENVOY_BUGs firing at once, which is worth investigation on its own. But I don't think we need to prematurely optimize this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment about this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just add a comment explaining that this is not performance critical path and contention on this mutex would imply that something horribly went off the rails and caused ENVOY_BUG to fire often on multiple threads.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Signed-off-by: Asra Ali <asraa@google.com>
ggreenway
ggreenway previously approved these changes Jun 16, 2020
@ggreenway
Copy link
Member

@asraa you have a merge conflict on the version history

asraa added 2 commits June 16, 2020 12:06
Signed-off-by: Asra Ali <asraa@google.com>
Signed-off-by: Asra Ali <asraa@google.com>
ggreenway
ggreenway previously approved these changes Jun 16, 2020
#define _ENVOY_BUG_IMPL(CONDITION, CONDITION_STR, ACTION, DETAILS) \
do { \
if (!(CONDITION) && Envoy::Assert::shouldLogAndInvokeEnvoyBugForEnvoyBugMacroUseOnly( \
__FILE__ ":" TOSTRING(__LINE__))) { \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to make sure that windows build uses appropriate compiler flags, as by default FILE does not have full path, just the filename. @sunjayBhatia @wrowe

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this issue of the lack of uniqueness in the contents of FILE an argument to re-consider use of static atomics for these counters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't counters have the same issue, since they'd be initialized according to file/line

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the regular component/misc logging uses __FILE__ and shows the full path on Windows:

[2020-06-16 18:13:24.358][14656][debug][misc] [source/common/network/io_socket_error_impl.cc:30] Unknown error code 123 details...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't find specifically where we pass /FC to MSVC, but seems that it is happening

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice this is great. Just a small comment about end-user documentation. Thank you!

/wait

hot_restart_generation, Gauge, Current hot restart generation -- like hot_restart_epoch but computed automatically by incrementing from parent.
initialization_time_ms, Histogram, Total time taken for Envoy initialization in milliseconds. This is the time from server start-up until the worker threads are ready to accept new connections
debug_assertion_failures, Counter, Number of debug assertion failures detected in a release build if compiled with `--define log_debug_assert_in_release=enabled` or zero otherwise
envoy_bug_failures, Counter, Number of envoy bug failures detected in a release build.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is going to mean anything to a normal user. Should we link somewhere that describes this in a bit more detail and what to do if this increments? Presumably open an issue as there is a serious issue, etc.?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done -- I also added some more doc string explanation in assert.h about its contrast with ASSERT.
I'd like to link that in, is that an appropriate place? Or link this PR?

Signed-off-by: Asra Ali <asraa@google.com>
Signed-off-by: Asra Ali <asraa@google.com>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM other than a typo (I think).

/wait


/**
* ENVOY_BUG must be called with two arguments for verbose logging.
* Indicate a efficient condition that should never be met in normal circumstances. In contrast
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Indicate a efficient condition that should never be met in normal circumstances. In contrast
* Indicate a failure condition that should never be met in normal circumstances. In contrast

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Signed-off-by: Asra Ali <asraa@google.com>
@stale
Copy link

stale bot commented Jun 26, 2020

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Jun 26, 2020
@asraa
Copy link
Contributor Author

asraa commented Jun 26, 2020

ping? I can merge master again if needed. vaguely afraid of triggering tsan failure though

@stale stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Jun 26, 2020
@mattklein123 mattklein123 self-assigned this Jun 26, 2020
Signed-off-by: Asra Ali <asraa@google.com>
@mattklein123 mattklein123 merged commit 1c1ce18 into envoyproxy:master Jun 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants