admin: roll back stats_handler.cc et al back to previous state while we sort out some sporadic crashes#20835
Merged
mattklein123 merged 6 commits intoenvoyproxy:mainfrom Apr 15, 2022
Conversation
…me issues that were discovered Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
vehre-x41
pushed a commit
to vehre-x41/envoy
that referenced
this pull request
Apr 19, 2022
…we sort out some sporadic crashes (envoyproxy#20835) Signed-off-by: Joshua Marantz <jmarantz@google.com> Signed-off-by: Andre Vehreschild <vehre@x41-dsec.de>
jmarantz
added a commit
that referenced
this pull request
Apr 19, 2022
Commit Message: Fix Stats::Scope destruct/iterate race by holding onto a weak_ptr<Scope> the scopes_ hash-table in ThreadLocalStore. Also adds GUARDED_BY thread annotation to the scopes_ hash table and refactors a bit to ensure thread safety across all accesses. The thread-safety analysis needs more-than-usual annotation assistance for two reasons: * the analysis system does not see that `ThreadLocalStoreImpl::lock_` and `ThreadLocalStoreImpl::ScopeImpl::parent_.lock_` are the same. * in safeMakeStat call-sites, for code-sharing reasons, we need to take a reference to the guarded `central_cache_` entry before we decide whether we need to take the lock the protects it, so we need to disable analysis in that case. This way we can share the code that finds stats in the TLS-cache without taking locks. A couple of helper methods,`centralCacheLockHeld()` and `centralCacheNoThreadAnalysis()`, were added to allow analysis to run with minimally scoped annotations. A testcase was added which duplicates the race between looping over the stats for admin, and creating/destroying scopes, using the fast /stats implementation that was disconnected in prod in #20835. This PR leaves the fast implementation disconnected, but fixes it. A separate PR will roll back #20835 after this lands. Additional Description: The repro can spot the race by reverting the definition of `ThreadLocalStoreImpl::forEachScope` to to its prior state, taking into account that `scopes_` is now a `map<ScopeImpl*, weak_ptr<ScopeImpl>` rather than a `set<ScopeImpl>`. ``` for (auto iter : scopes_) { f_scope(*(iter.first)); } ``` Then test/server/admin:stats_handler_test, test will fail wtih ``` RUN ] ThreadedTest.Threaded terminate called after throwing an instance of 'std::bad_weak_ptr' what(): bad_weak_ptr ``` Risk Level: low -- scope iteration is being fixed here, but that doesn't happen in production yet. Testing: //test/... Docs Changes: n/a Release Notes: n/a Platform Specific Features: n/a Signed-off-by: Joshua Marantz <jmarantz@google.com>
jmarantz
added a commit
that referenced
this pull request
Apr 20, 2022
Commit Message: Rolls back the rollback PR #20835 , re-enabling fast admin stats, now that #20855 has landed. Additional Description: This brings the performance back to this state: ``` ------------------------------------------------------------------ Benchmark Time CPU Iterations ------------------------------------------------------------------ BM_AllCountersText 467 ms 466 ms 2 BM_UsedCountersText 37.0 ms 37.0 ms 19 BM_FilteredCountersText 1793 ms 1792 ms 1 BM_AllCountersJson 504 ms 504 ms 1 BM_UsedCountersJson 37.2 ms 37.2 ms 19 BM_FilteredCountersJson 1839 ms 1839 ms 1 ``` So: around half a second of CPU burst for 1M json & text stats, rather than 1.8 seconds for text and 4.7 seconds for json. We still have a std::regex bottleneck when a filter is specified. Risk Level: medium -- this re-enables calling of code that previously had had races, though #20855 repro'd and fixes them. Testing: //test/... Docs Changes: n/a Release Notes: n/a Platform Specific Features: n/a Signed-off-by: Joshua Marantz <jmarantz@google.com>
ravenblackx
pushed a commit
to ravenblackx/envoy
that referenced
this pull request
Jun 8, 2022
…we sort out some sporadic crashes (envoyproxy#20835) Signed-off-by: Joshua Marantz <jmarantz@google.com>
ravenblackx
pushed a commit
to ravenblackx/envoy
that referenced
this pull request
Jun 8, 2022
Commit Message: Fix Stats::Scope destruct/iterate race by holding onto a weak_ptr<Scope> the scopes_ hash-table in ThreadLocalStore. Also adds GUARDED_BY thread annotation to the scopes_ hash table and refactors a bit to ensure thread safety across all accesses. The thread-safety analysis needs more-than-usual annotation assistance for two reasons: * the analysis system does not see that `ThreadLocalStoreImpl::lock_` and `ThreadLocalStoreImpl::ScopeImpl::parent_.lock_` are the same. * in safeMakeStat call-sites, for code-sharing reasons, we need to take a reference to the guarded `central_cache_` entry before we decide whether we need to take the lock the protects it, so we need to disable analysis in that case. This way we can share the code that finds stats in the TLS-cache without taking locks. A couple of helper methods,`centralCacheLockHeld()` and `centralCacheNoThreadAnalysis()`, were added to allow analysis to run with minimally scoped annotations. A testcase was added which duplicates the race between looping over the stats for admin, and creating/destroying scopes, using the fast /stats implementation that was disconnected in prod in envoyproxy#20835. This PR leaves the fast implementation disconnected, but fixes it. A separate PR will roll back envoyproxy#20835 after this lands. Additional Description: The repro can spot the race by reverting the definition of `ThreadLocalStoreImpl::forEachScope` to to its prior state, taking into account that `scopes_` is now a `map<ScopeImpl*, weak_ptr<ScopeImpl>` rather than a `set<ScopeImpl>`. ``` for (auto iter : scopes_) { f_scope(*(iter.first)); } ``` Then test/server/admin:stats_handler_test, test will fail wtih ``` RUN ] ThreadedTest.Threaded terminate called after throwing an instance of 'std::bad_weak_ptr' what(): bad_weak_ptr ``` Risk Level: low -- scope iteration is being fixed here, but that doesn't happen in production yet. Testing: //test/... Docs Changes: n/a Release Notes: n/a Platform Specific Features: n/a Signed-off-by: Joshua Marantz <jmarantz@google.com>
ravenblackx
pushed a commit
to ravenblackx/envoy
that referenced
this pull request
Jun 8, 2022
Commit Message: Rolls back the rollback PR envoyproxy#20835 , re-enabling fast admin stats, now that envoyproxy#20855 has landed. Additional Description: This brings the performance back to this state: ``` ------------------------------------------------------------------ Benchmark Time CPU Iterations ------------------------------------------------------------------ BM_AllCountersText 467 ms 466 ms 2 BM_UsedCountersText 37.0 ms 37.0 ms 19 BM_FilteredCountersText 1793 ms 1792 ms 1 BM_AllCountersJson 504 ms 504 ms 1 BM_UsedCountersJson 37.2 ms 37.2 ms 19 BM_FilteredCountersJson 1839 ms 1839 ms 1 ``` So: around half a second of CPU burst for 1M json & text stats, rather than 1.8 seconds for text and 4.7 seconds for json. We still have a std::regex bottleneck when a filter is specified. Risk Level: medium -- this re-enables calling of code that previously had had races, though envoyproxy#20855 repro'd and fixes them. Testing: //test/... Docs Changes: n/a Release Notes: n/a Platform Specific Features: n/a Signed-off-by: Joshua Marantz <jmarantz@google.com>
oschaaf
pushed a commit
to maistra/envoy
that referenced
this pull request
Oct 26, 2022
Commit Message: Fix Stats::Scope destruct/iterate race by holding onto a weak_ptr<Scope> the scopes_ hash-table in ThreadLocalStore. Also adds GUARDED_BY thread annotation to the scopes_ hash table and refactors a bit to ensure thread safety across all accesses. The thread-safety analysis needs more-than-usual annotation assistance for two reasons: * the analysis system does not see that `ThreadLocalStoreImpl::lock_` and `ThreadLocalStoreImpl::ScopeImpl::parent_.lock_` are the same. * in safeMakeStat call-sites, for code-sharing reasons, we need to take a reference to the guarded `central_cache_` entry before we decide whether we need to take the lock the protects it, so we need to disable analysis in that case. This way we can share the code that finds stats in the TLS-cache without taking locks. A couple of helper methods,`centralCacheLockHeld()` and `centralCacheNoThreadAnalysis()`, were added to allow analysis to run with minimally scoped annotations. A testcase was added which duplicates the race between looping over the stats for admin, and creating/destroying scopes, using the fast /stats implementation that was disconnected in prod in envoyproxy/envoy#20835. This PR leaves the fast implementation disconnected, but fixes it. A separate PR will roll back #20835 after this lands. Additional Description: The repro can spot the race by reverting the definition of `ThreadLocalStoreImpl::forEachScope` to to its prior state, taking into account that `scopes_` is now a `map<ScopeImpl*, weak_ptr<ScopeImpl>` rather than a `set<ScopeImpl>`. ``` for (auto iter : scopes_) { f_scope(*(iter.first)); } ``` Then test/server/admin:stats_handler_test, test will fail wtih ``` RUN ] ThreadedTest.Threaded terminate called after throwing an instance of 'std::bad_weak_ptr' what(): bad_weak_ptr ``` Risk Level: low -- scope iteration is being fixed here, but that doesn't happen in production yet. Testing: //test/... Docs Changes: n/a Release Notes: n/a Platform Specific Features: n/a Signed-off-by: Joshua Marantz <jmarantz@google.com>
oschaaf
pushed a commit
to maistra/envoy
that referenced
this pull request
Oct 26, 2022
Commit Message: Rolls back the rollback PR envoyproxy/envoy#20835 , re-enabling fast admin stats, now that envoyproxy/envoy#20855 has landed. Additional Description: This brings the performance back to this state: ``` ------------------------------------------------------------------ Benchmark Time CPU Iterations ------------------------------------------------------------------ BM_AllCountersText 467 ms 466 ms 2 BM_UsedCountersText 37.0 ms 37.0 ms 19 BM_FilteredCountersText 1793 ms 1792 ms 1 BM_AllCountersJson 504 ms 504 ms 1 BM_UsedCountersJson 37.2 ms 37.2 ms 19 BM_FilteredCountersJson 1839 ms 1839 ms 1 ``` So: around half a second of CPU burst for 1M json & text stats, rather than 1.8 seconds for text and 4.7 seconds for json. We still have a std::regex bottleneck when a filter is specified. Risk Level: medium -- this re-enables calling of code that previously had had races, though #20855 repro'd and fixes them. Testing: //test/... Docs Changes: n/a Release Notes: n/a Platform Specific Features: n/a Signed-off-by: Joshua Marantz <jmarantz@google.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Commit Message: Rolls back the operational part of #19693 which has seen some sporadic crashes in some environments. This does not fully roll back that PR (which has had some follow-on PRs) but it rolls back the operational code behind the
/statsendpoint.I suspect some race issue, combined with shared_ptr vs unique_ptr semantics, and probaby a system-wide stress test with tsan will help.
Additional Description: This reverts back the performance of the
/statsendpoint for 1M stats to:the current main has this performance:
Risk Level: medium -- had to manually tweak the rollback as it was not 100% clean
Testing: //test/...
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a