[server] add unused ENVOY_BUG implementation#11503
[server] add unused ENVOY_BUG implementation#11503mattklein123 merged 21 commits intoenvoyproxy:masterfrom
Conversation
Signed-off-by: Asra Ali <asraa@google.com>
Signed-off-by: Asra Ali <asraa@google.com>
Signed-off-by: Asra Ali <asraa@google.com>
Signed-off-by: Asra Ali <asraa@google.com>
antoniovicente
left a comment
There was a problem hiding this comment.
Looks good otherwise. Thanks so much for making these changes.
test/common/common/assert_test.cc
Outdated
| EXPECT_DEATH({ ENVOY_BUG(false); }, ".*envoy bug failure: 0.*"); | ||
| EXPECT_DEATH({ ENVOY_BUG(false, ""); }, ".*envoy bug failure: 0.*"); | ||
| EXPECT_DEATH({ ENVOY_BUG(false, "With some logs"); }, | ||
| ".*envoy bug failure: 0. Details: With some logs.*"); |
There was a problem hiding this comment.
informational only, no action needed:
There's something interesting about ENVOY_BUG in debug modes that we may want to comment about here:
Exponential backoff doesn't seem to be triggering for some reason. Fortunately, that's a good thing: we want ENVOY_BUG to act like an ASSERT always in debug modes so we can more easily detect when it happens and ensure that our test cases do not become order dependent. I think this difference in behavior is related to the use of EXPECT_DEATH which does a fork; count is at 0 when each of the ENVOY_BUGs above execute.
Possible suggestion: it would be good to have macros that make it easier to test for code triggering ENVOY_BUG. I guess we can use EXPECT_DEBUG_DEATH until a more specialized macro exists.
Signed-off-by: Asra Ali <asraa@google.com>
|
This sounds roughly equivalent to |
Nice question. For re-use, what about when ENVOY_LOG_DEBUG_ASSERT_IN_RELEASE is defined, all ASSERTs are treated as ENVOY_BUGs (log failure, increment counter)? Does that align with the use of ENVOY_LOG_DEBUG_ASSERT_IN_RELEASE? This would change ASSERT behavior, since ENVOY_BUG has exponential back-off behavior for counter increment and logging. But I think that makes sense for ENVOY_LOG_DEBUG_ASSERT_IN_RELEASE mode as well. That would also unify the stats. /cc @curiouserrandy |
Signed-off-by: Asra Ali <asraa@google.com>
One issue with ENVOY_LOG_DEBUG_ASSERT_IN_RELEASE is the performance cost of enabling this behavior for all ASSERTs. The two modes satisfy slightly different needs. |
ggreenway
left a comment
There was a problem hiding this comment.
I like the idea of checking more conditions in a non-fatal way in release builds.
In my opinion, we should move every ASSERT that isn't in a performance critical path to this mechanism. Would it make sense to make ENVOY_BUG into ASSERT, and then have EXPENSIVE_ASSERT or similar for the cases we identify that are performance sensitive?
source/common/common/assert.cc
Outdated
| static bool shouldLogAndInvoke() { | ||
| ++count_; | ||
| // Check if count_ is power of two by its bitwise representation. | ||
| if ((count_ & (count_ - 1)) == 0) { |
There was a problem hiding this comment.
You should latch count_ after incrementing (I think there's a method on std::atomic to increment and return the value). Otherwise this could change between when you set it and where you check it here, and since you use it twice, it could be change between the two accesses.
There was a problem hiding this comment.
Good point... Something like this should work:
auto counter_value = ++count_;
if ((counter_value & (counter_value - 1)) == 0) {
There was a problem hiding this comment.
Could you take a look at this portion again? Is guarding counters_ by a mutex ok?
There was a problem hiding this comment.
I think use of a map and mutex for counters is risky even though we expect these events to be rare. You don't get the benefit of having the exponential backoff reduce the effective cost of operations inside a mutex to 0.
Should you consider having a static member defined inside the macro scope to hold the per-line counter?
There was a problem hiding this comment.
Alternative that may provide a centralized list of counters that have been hit, provide a way to reset the counters and avoid mutex acquisition outside first hit:
- Add a "AtomicCounter* GetCounter(file + line)" static method that gets or creates a counter by name under a mutex.
- Under the body of "if (CONDITION) {" have "static AtomicCounter* counter = GetCounter(file, line);" to create the counter on first access.
- Add a function to set all counters know to the map back to 0 between tests.
There was a problem hiding this comment.
I have a working version like above without static init for the counter, because resetting the counters via a map clearing method doesn't have an effect when i have static initialization like "static AtomicCounterPtr ptr_for_file_line".
Is it possible to avoid map lookup (the most expensive mutex locked operation) for each iterations while still allowing a working map reset? Former solved by statics, latter nullifies the statics solution.
There was a problem hiding this comment.
because resetting the counters via a map clearing method doesn't have an effect when i have static initialization like "static AtomicCounterPtr ptr_for_file_line".
Reset would need to iterate through the map and set all counters to zero, not erase anything in the map.
I agree with @ahedberg though that if we get to the point that we're contending on this mutex, we have big problems. I don't think having a static is worth it.
There was a problem hiding this comment.
Use of a mutex and map makes me uncomfortable. Resetting counters to 0 between does not seem like a realistic goal.
There was a problem hiding this comment.
I'd be ok with either (static, or map+mutex). I don't feel that strongly about it. I agree that the need for reset can be removed by adding a flag (used only by tests) to not do backoff and take the action on every failure of the condition.
If we do the static std::atomic, I'd like to keep the macro-generated code as small as possible, which can be done by passing the atomic as an argument to shouldLogAndInvokeEnvoyBugForEnvoyBugMacroUseOnly instead of the file/line number.
There was a problem hiding this comment.
Alright -- the current is my map+mutex implementation, i moved the mutex lock to just over the map access.
I have a version fully implemented with static atomics and counter resets (this was easier than a flag, when dealing with parametrized tests). It does indeed pass the atomic to the shouldLogAndInvokeEnvoyBugForEnvoyBugMacroUseOnly. Should I push that just for review? Not sure.
Yeah, I think that would make sense. I think they could use the same stat; they're both indicating roughly the same thing.
Agreed. |
Theoretically this sounds nice, but practically I think this will be incredibly difficult to audit that we would feel comfortable with a bulk replace. I would recommend we proceed with this new thing and see how many places we actually end up wanting to use it. |
Signed-off-by: Asra Ali <asraa@google.com>
Signed-off-by: Asra Ali <asraa@google.com>
|
Changed to per-line back-off. Regarding doing this for ASSERTs in ENVOY_LOG_DEBUG_ASSERT_IN_RELEASE mode as well. I had wanted to re-use stats counters for implementing the per-line counting, but stats depends on assert_lib. There's a mutex guarded flat hash map there now under the assumption we don't trigger ENVOY_BUG failures often. Is it clear that we don't hit ASSERTs often in prod, enough that using this flat hash map is enough of a performance hit? If so, will unify the stats as mentioned before. @ggreenway
Agree. I assume it'll be incremental adoption. |
ggreenway
left a comment
There was a problem hiding this comment.
Theoretically this sounds nice, but practically I think this will be incredibly difficult to audit that we would feel comfortable with a bulk replace. I would recommend we proceed with this new thing and see how many places we actually end up wanting to use it.
Fair point.
I think use of a map and mutex for counters is risky even though we expect these events to be rare. You don't get the benefit of having the exponential backoff reduce the effective cost of operations inside a mutex to 0.
Should you consider having a static member defined inside the macro scope to hold the per-line counter?
Given that the mutex is only grabbed in the event that the condition fails, which should be exceedingly rare, I think it's fine from a performance point of view.
The external map+mutex approach also has the advantage that it's easy to reset between tests.
I worry that it will be exceedingly rare until it isn't. We are doing exponential backoff for a reason. Regarding tests, our usual approach for having consistent behavior in tests that exercise this functionality is to have a mechanism to disable exponential backoff in tests. The fallback to abort behavior in debug modes doesn't do exponential backoff and thus works very nicely, but does require use of EXPECT_DEBUG_DEATH. |
Signed-off-by: Asra Ali <asraa@google.com>
|
I'm not sure how to proceed here. Greg/Ashley: Do you want to take over this review? I'm fine excusing myself from it. |
source/common/common/assert.cc
Outdated
| } | ||
|
|
||
| static bool shouldLogAndInvoke(const char* filename, int line) { | ||
| const auto name = absl::StrCat(filename, ",", line); |
There was a problem hiding this comment.
perf nit: I would probably pre-stringify the file+line in the macro, since I think the incremental code size is probably about 0, then you don't need to do a potential allocation here.
| static bool shouldLogAndInvoke(const char* filename, int line) { | ||
| const auto name = absl::StrCat(filename, ",", line); | ||
|
|
||
| // Increment counter, inserting first if counter does not exist. |
There was a problem hiding this comment.
I don't think I have a strong opinion on the map vs. inline static, but can you add more comments here on the trade-offs and why we chose this way? I assume it's to reduce code size and I guess per the discussion avoid cache misses on the inline atomic?
There was a problem hiding this comment.
It's not that we're trying to avoid cache misses on the atomic; if we aren't worried about cache misses in this part of Envoy, and if we don't have benchmarks proving that atomics are significantly more efficient, then we don't have a performance argument for choosing atomics over a higher-level, easier-to-understand construct like a mutex.
Quoting from an internal doc on the dangers of atomics that I've been meaning to open-source: There's a common assumption that mutexes are expensive, and that using atomic operations will be more efficient. But in reality, acquiring and releasing a mutex is cheaper than a cache miss; attention to cache behavior is usually a more fruitful way to improve performance.
There was a problem hiding this comment.
Will do.
Quick question: The biggest different I can see is that this map+mutex impl holds the mutex for the entire map, not just the particular file/line counter.
Is the solution to have some kind of static object per file/line in the macro that holds a mutex per file/line and makes accesses to the map? Or is that starting to get too complicated.
There was a problem hiding this comment.
IMO, a per file/line mutex/object/whatever is more complicated than is necessary at the moment. If we start seeing contention on this mutex, we'd have a lot of ENVOY_BUGs firing at once, which is worth investigation on its own. But I don't think we need to prematurely optimize this.
There was a problem hiding this comment.
Added a comment about this.
There was a problem hiding this comment.
I would just add a comment explaining that this is not performance critical path and contention on this mutex would imply that something horribly went off the rails and caused ENVOY_BUG to fire often on multiple threads.
Signed-off-by: Asra Ali <asraa@google.com>
|
@asraa you have a merge conflict on the version history |
Signed-off-by: Asra Ali <asraa@google.com>
| #define _ENVOY_BUG_IMPL(CONDITION, CONDITION_STR, ACTION, DETAILS) \ | ||
| do { \ | ||
| if (!(CONDITION) && Envoy::Assert::shouldLogAndInvokeEnvoyBugForEnvoyBugMacroUseOnly( \ | ||
| __FILE__ ":" TOSTRING(__LINE__))) { \ |
There was a problem hiding this comment.
We have to make sure that windows build uses appropriate compiler flags, as by default FILE does not have full path, just the filename. @sunjayBhatia @wrowe
There was a problem hiding this comment.
Is this issue of the lack of uniqueness in the contents of FILE an argument to re-consider use of static atomics for these counters?
There was a problem hiding this comment.
Wouldn't counters have the same issue, since they'd be initialized according to file/line
There was a problem hiding this comment.
the regular component/misc logging uses __FILE__ and shows the full path on Windows:
[2020-06-16 18:13:24.358][14656][debug][misc] [source/common/network/io_socket_error_impl.cc:30] Unknown error code 123 details...
There was a problem hiding this comment.
Can't find specifically where we pass /FC to MSVC, but seems that it is happening
mattklein123
left a comment
There was a problem hiding this comment.
Nice this is great. Just a small comment about end-user documentation. Thank you!
/wait
| hot_restart_generation, Gauge, Current hot restart generation -- like hot_restart_epoch but computed automatically by incrementing from parent. | ||
| initialization_time_ms, Histogram, Total time taken for Envoy initialization in milliseconds. This is the time from server start-up until the worker threads are ready to accept new connections | ||
| debug_assertion_failures, Counter, Number of debug assertion failures detected in a release build if compiled with `--define log_debug_assert_in_release=enabled` or zero otherwise | ||
| envoy_bug_failures, Counter, Number of envoy bug failures detected in a release build. |
There was a problem hiding this comment.
I don't think this is going to mean anything to a normal user. Should we link somewhere that describes this in a bit more detail and what to do if this increments? Presumably open an issue as there is a serious issue, etc.?
There was a problem hiding this comment.
done -- I also added some more doc string explanation in assert.h about its contrast with ASSERT.
I'd like to link that in, is that an appropriate place? Or link this PR?
Signed-off-by: Asra Ali <asraa@google.com>
mattklein123
left a comment
There was a problem hiding this comment.
LGTM other than a typo (I think).
/wait
source/common/common/assert.h
Outdated
|
|
||
| /** | ||
| * ENVOY_BUG must be called with two arguments for verbose logging. | ||
| * Indicate a efficient condition that should never be met in normal circumstances. In contrast |
There was a problem hiding this comment.
| * Indicate a efficient condition that should never be met in normal circumstances. In contrast | |
| * Indicate a failure condition that should never be met in normal circumstances. In contrast |
?
|
This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
|
ping? I can merge master again if needed. vaguely afraid of triggering tsan failure though |
Signed-off-by: Asra Ali <asraa@google.com>
Commit Message: Adds an implementation for ENVOY_BUG macro that (in contrast with ASSERT) is compiled in release mode. If a failure is met in release mode, it is logged at a critical level and a stat is incremented with exponential back-off. In debug mode, this aborts like ASSERT.
This is meant to be used for performant conditions that shouldn't be met in normal circumstances. This allows users in production to monitor the logs and stats for ENVOY_BUG conditions in production.
Risk Level: Low
Testing: Added server test testing stats
Release notes: Added note about new server stat, update stats doc
Signed-off-by: Asra Ali asraa@google.com