admin: roll back stats_handler.cc et al back to previous state while we sort out some sporadic crashes by jmarantz · Pull Request #20835 · envoyproxy/envoy

jmarantz · 2022-04-14T20:47:07Z

Commit Message: Rolls back the operational part of #19693 which has seen some sporadic crashes in some environments. This does not fully roll back that PR (which has had some follow-on PRs) but it rolls back the operational code behind the /stats endpoint.

I suspect some race issue, combined with shared_ptr vs unique_ptr semantics, and probaby a system-wide stress test with tsan will help.
Additional Description: This reverts back the performance of the /stats endpoint for 1M stats to:

Benchmark                        Time             CPU   Iterations
------------------------------------------------------------------
BM_AllCountersText            1813 ms         1812 ms            1
BM_UsedCountersText            133 ms          133 ms            5
BM_FilteredCountersText       4510 ms         4509 ms            1
BM_AllCountersJson            4738 ms         4735 ms            1
BM_UsedCountersJson            158 ms          158 ms            4
BM_FilteredCountersJson       4504 ms         4503 ms            1

the current main has this performance:

------------------------------------------------------------------
Benchmark                        Time             CPU   Iterations
------------------------------------------------------------------
BM_AllCountersText             458 ms          458 ms            2
BM_UsedCountersText           36.1 ms         36.1 ms           19
BM_FilteredCountersText       1790 ms         1790 ms            1
BM_AllCountersJson             496 ms          496 ms            2
BM_UsedCountersJson           36.4 ms         36.4 ms           19
BM_FilteredCountersJson       1789 ms         1789 ms            1

Risk Level: medium -- had to manually tweak the rollback as it was not 100% clean
Testing: //test/...
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a

…me issues that were discovered Signed-off-by: Joshua Marantz <jmarantz@google.com>

repokitteh-read-only · 2022-04-14T20:47:10Z

As a reminder, PRs marked as draft will not be automatically assigned reviewers,
or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

🐱

Caused by: #20835 was opened by jmarantz.

see: more, trace.

Signed-off-by: Joshua Marantz <jmarantz@google.com>

mattklein123

Thanks!

…we sort out some sporadic crashes (envoyproxy#20835) Signed-off-by: Joshua Marantz <jmarantz@google.com> Signed-off-by: Andre Vehreschild <vehre@x41-dsec.de>

Commit Message: Fix Stats::Scope destruct/iterate race by holding onto a weak_ptr<Scope> the scopes_ hash-table in ThreadLocalStore. Also adds GUARDED_BY thread annotation to the scopes_ hash table and refactors a bit to ensure thread safety across all accesses. The thread-safety analysis needs more-than-usual annotation assistance for two reasons: * the analysis system does not see that `ThreadLocalStoreImpl::lock_` and `ThreadLocalStoreImpl::ScopeImpl::parent_.lock_` are the same. * in safeMakeStat call-sites, for code-sharing reasons, we need to take a reference to the guarded `central_cache_` entry before we decide whether we need to take the lock the protects it, so we need to disable analysis in that case. This way we can share the code that finds stats in the TLS-cache without taking locks. A couple of helper methods,`centralCacheLockHeld()` and `centralCacheNoThreadAnalysis()`, were added to allow analysis to run with minimally scoped annotations. A testcase was added which duplicates the race between looping over the stats for admin, and creating/destroying scopes, using the fast /stats implementation that was disconnected in prod in #20835. This PR leaves the fast implementation disconnected, but fixes it. A separate PR will roll back #20835 after this lands. Additional Description: The repro can spot the race by reverting the definition of `ThreadLocalStoreImpl::forEachScope` to to its prior state, taking into account that `scopes_` is now a `map<ScopeImpl*, weak_ptr<ScopeImpl>` rather than a `set<ScopeImpl>`. ``` for (auto iter : scopes_) { f_scope(*(iter.first)); } ``` Then test/server/admin:stats_handler_test, test will fail wtih ``` RUN ] ThreadedTest.Threaded terminate called after throwing an instance of 'std::bad_weak_ptr' what(): bad_weak_ptr ``` Risk Level: low -- scope iteration is being fixed here, but that doesn't happen in production yet. Testing: //test/... Docs Changes: n/a Release Notes: n/a Platform Specific Features: n/a Signed-off-by: Joshua Marantz <jmarantz@google.com>

Commit Message: Rolls back the rollback PR #20835 , re-enabling fast admin stats, now that #20855 has landed. Additional Description: This brings the performance back to this state: ``` ------------------------------------------------------------------ Benchmark Time CPU Iterations ------------------------------------------------------------------ BM_AllCountersText 467 ms 466 ms 2 BM_UsedCountersText 37.0 ms 37.0 ms 19 BM_FilteredCountersText 1793 ms 1792 ms 1 BM_AllCountersJson 504 ms 504 ms 1 BM_UsedCountersJson 37.2 ms 37.2 ms 19 BM_FilteredCountersJson 1839 ms 1839 ms 1 ``` So: around half a second of CPU burst for 1M json & text stats, rather than 1.8 seconds for text and 4.7 seconds for json. We still have a std::regex bottleneck when a filter is specified. Risk Level: medium -- this re-enables calling of code that previously had had races, though #20855 repro'd and fixes them. Testing: //test/... Docs Changes: n/a Release Notes: n/a Platform Specific Features: n/a Signed-off-by: Joshua Marantz <jmarantz@google.com>

…we sort out some sporadic crashes (envoyproxy#20835) Signed-off-by: Joshua Marantz <jmarantz@google.com>

Commit Message: Fix Stats::Scope destruct/iterate race by holding onto a weak_ptr<Scope> the scopes_ hash-table in ThreadLocalStore. Also adds GUARDED_BY thread annotation to the scopes_ hash table and refactors a bit to ensure thread safety across all accesses. The thread-safety analysis needs more-than-usual annotation assistance for two reasons: * the analysis system does not see that `ThreadLocalStoreImpl::lock_` and `ThreadLocalStoreImpl::ScopeImpl::parent_.lock_` are the same. * in safeMakeStat call-sites, for code-sharing reasons, we need to take a reference to the guarded `central_cache_` entry before we decide whether we need to take the lock the protects it, so we need to disable analysis in that case. This way we can share the code that finds stats in the TLS-cache without taking locks. A couple of helper methods,`centralCacheLockHeld()` and `centralCacheNoThreadAnalysis()`, were added to allow analysis to run with minimally scoped annotations. A testcase was added which duplicates the race between looping over the stats for admin, and creating/destroying scopes, using the fast /stats implementation that was disconnected in prod in envoyproxy#20835. This PR leaves the fast implementation disconnected, but fixes it. A separate PR will roll back envoyproxy#20835 after this lands. Additional Description: The repro can spot the race by reverting the definition of `ThreadLocalStoreImpl::forEachScope` to to its prior state, taking into account that `scopes_` is now a `map<ScopeImpl*, weak_ptr<ScopeImpl>` rather than a `set<ScopeImpl>`. ``` for (auto iter : scopes_) { f_scope(*(iter.first)); } ``` Then test/server/admin:stats_handler_test, test will fail wtih ``` RUN ] ThreadedTest.Threaded terminate called after throwing an instance of 'std::bad_weak_ptr' what(): bad_weak_ptr ``` Risk Level: low -- scope iteration is being fixed here, but that doesn't happen in production yet. Testing: //test/... Docs Changes: n/a Release Notes: n/a Platform Specific Features: n/a Signed-off-by: Joshua Marantz <jmarantz@google.com>

Commit Message: Rolls back the rollback PR envoyproxy#20835 , re-enabling fast admin stats, now that envoyproxy#20855 has landed. Additional Description: This brings the performance back to this state: ``` ------------------------------------------------------------------ Benchmark Time CPU Iterations ------------------------------------------------------------------ BM_AllCountersText 467 ms 466 ms 2 BM_UsedCountersText 37.0 ms 37.0 ms 19 BM_FilteredCountersText 1793 ms 1792 ms 1 BM_AllCountersJson 504 ms 504 ms 1 BM_UsedCountersJson 37.2 ms 37.2 ms 19 BM_FilteredCountersJson 1839 ms 1839 ms 1 ``` So: around half a second of CPU burst for 1M json & text stats, rather than 1.8 seconds for text and 4.7 seconds for json. We still have a std::regex bottleneck when a filter is specified. Risk Level: medium -- this re-enables calling of code that previously had had races, though envoyproxy#20855 repro'd and fixes them. Testing: //test/... Docs Changes: n/a Release Notes: n/a Platform Specific Features: n/a Signed-off-by: Joshua Marantz <jmarantz@google.com>

Commit Message: Fix Stats::Scope destruct/iterate race by holding onto a weak_ptr<Scope> the scopes_ hash-table in ThreadLocalStore. Also adds GUARDED_BY thread annotation to the scopes_ hash table and refactors a bit to ensure thread safety across all accesses. The thread-safety analysis needs more-than-usual annotation assistance for two reasons: * the analysis system does not see that `ThreadLocalStoreImpl::lock_` and `ThreadLocalStoreImpl::ScopeImpl::parent_.lock_` are the same. * in safeMakeStat call-sites, for code-sharing reasons, we need to take a reference to the guarded `central_cache_` entry before we decide whether we need to take the lock the protects it, so we need to disable analysis in that case. This way we can share the code that finds stats in the TLS-cache without taking locks. A couple of helper methods,`centralCacheLockHeld()` and `centralCacheNoThreadAnalysis()`, were added to allow analysis to run with minimally scoped annotations. A testcase was added which duplicates the race between looping over the stats for admin, and creating/destroying scopes, using the fast /stats implementation that was disconnected in prod in envoyproxy/envoy#20835. This PR leaves the fast implementation disconnected, but fixes it. A separate PR will roll back #20835 after this lands. Additional Description: The repro can spot the race by reverting the definition of `ThreadLocalStoreImpl::forEachScope` to to its prior state, taking into account that `scopes_` is now a `map<ScopeImpl*, weak_ptr<ScopeImpl>` rather than a `set<ScopeImpl>`. ``` for (auto iter : scopes_) { f_scope(*(iter.first)); } ``` Then test/server/admin:stats_handler_test, test will fail wtih ``` RUN ] ThreadedTest.Threaded terminate called after throwing an instance of 'std::bad_weak_ptr' what(): bad_weak_ptr ``` Risk Level: low -- scope iteration is being fixed here, but that doesn't happen in production yet. Testing: //test/... Docs Changes: n/a Release Notes: n/a Platform Specific Features: n/a Signed-off-by: Joshua Marantz <jmarantz@google.com>

Commit Message: Rolls back the rollback PR envoyproxy/envoy#20835 , re-enabling fast admin stats, now that envoyproxy/envoy#20855 has landed. Additional Description: This brings the performance back to this state: ``` ------------------------------------------------------------------ Benchmark Time CPU Iterations ------------------------------------------------------------------ BM_AllCountersText 467 ms 466 ms 2 BM_UsedCountersText 37.0 ms 37.0 ms 19 BM_FilteredCountersText 1793 ms 1792 ms 1 BM_AllCountersJson 504 ms 504 ms 1 BM_UsedCountersJson 37.2 ms 37.2 ms 19 BM_FilteredCountersJson 1839 ms 1839 ms 1 ``` So: around half a second of CPU burst for 1M json & text stats, rather than 1.8 seconds for text and 4.7 seconds for json. We still have a std::regex bottleneck when a filter is specified. Risk Level: medium -- this re-enables calling of code that previously had had races, though #20855 repro'd and fixes them. Testing: //test/... Docs Changes: n/a Release Notes: n/a Platform Specific Features: n/a Signed-off-by: Joshua Marantz <jmarantz@google.com>

revert stats_handler.cc et al back to previous state while we sort so…

ece2d3a

…me issues that were discovered Signed-off-by: Joshua Marantz <jmarantz@google.com>

jmarantz added 2 commits April 14, 2022 17:08

re-add new test that also works with old code.

833c511

Signed-off-by: Joshua Marantz <jmarantz@google.com>

clean up

319c67b

Signed-off-by: Joshua Marantz <jmarantz@google.com>

jmarantz changed the title ~~WiP revert stats_handler.cc et al back to previous state while we sort some issues that were discovered~~ admin: rol back stats_handler.cc et al back to previous state while we sort out some sporadic crashes Apr 14, 2022

jmarantz changed the title ~~admin: rol back stats_handler.cc et al back to previous state while we sort out some sporadic crashes~~ admin: roll back stats_handler.cc et al back to previous state while we sort out some sporadic crashes Apr 14, 2022

jmarantz marked this pull request as ready for review April 14, 2022 21:55

jmarantz assigned mattklein123 and ggreenway Apr 14, 2022

jmarantz added 3 commits April 14, 2022 17:58

revert bazelversion

71418dd

Signed-off-by: Joshua Marantz <jmarantz@google.com>

coverage

5199d8d

Signed-off-by: Joshua Marantz <jmarantz@google.com>

Merge branch 'main' into revert-admin-stat-speedup

912a38a

Signed-off-by: Joshua Marantz <jmarantz@google.com>

jmarantz mentioned this pull request Apr 15, 2022

admin: Richer HTML home page with forms for params #19546

Merged

mattklein123 approved these changes Apr 15, 2022

View reviewed changes

mattklein123 merged commit d9f614d into envoyproxy:main Apr 15, 2022

jmarantz deleted the revert-admin-stat-speedup branch April 15, 2022 15:46

jmarantz mentioned this pull request Apr 17, 2022

stats: Repro & fix admin stats crash #20855

Merged

jmarantz mentioned this pull request Apr 19, 2022

admin: re-enable fast admin stats #20887

Merged

ravenblackx pushed a commit to ravenblackx/envoy that referenced this pull request Jun 8, 2022

admin: roll back stats_handler.cc et al back to previous state while …

15af494

…we sort out some sporadic crashes (envoyproxy#20835) Signed-off-by: Joshua Marantz <jmarantz@google.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

admin: roll back stats_handler.cc et al back to previous state while we sort out some sporadic crashes#20835

admin: roll back stats_handler.cc et al back to previous state while we sort out some sporadic crashes#20835
mattklein123 merged 6 commits intoenvoyproxy:mainfrom
jmarantz:revert-admin-stat-speedup

jmarantz commented Apr 14, 2022 •

edited

Loading

Uh oh!

repokitteh-read-only bot commented Apr 14, 2022

Uh oh!

mattklein123 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jmarantz commented Apr 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

repokitteh-read-only bot commented Apr 14, 2022

Uh oh!

mattklein123 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jmarantz commented Apr 14, 2022 •

edited

Loading