stats: Repro & fix admin stats crash by jmarantz · Pull Request #20855 · envoyproxy/envoy

jmarantz · 2022-04-16T04:36:22Z

Commit Message: Fix Stats::Scope destruct/iterate race by holding onto a weak_ptr the scopes_ hash-table in ThreadLocalStore. Also adds GUARDED_BY thread annotation to the scopes_ hash table and refactors a bit to ensure thread safety across all accesses. The thread-safety analysis needs more-than-usual annotation assistance for two reasons:

the analysis system does not see that ThreadLocalStoreImpl::lock_ and ThreadLocalStoreImpl::ScopeImpl::parent_.lock_ are the same.
in safeMakeStat call-sites, for code-sharing reasons, we need to take a reference to the guarded central_cache_ entry before we decide whether we need to take the lock the protects it, so we need to disable analysis in that case. This way we can share the code that finds stats in the TLS-cache without taking locks.

A couple of helper methods,centralCacheLockHeld() and centralCacheNoThreadAnalysis(), were added to allow analysis to run with minimally scoped annotations.

A testcase was added which duplicates the race between looping over the stats for admin, and creating/destroying scopes, using the fast /stats implementation that was disconnected in prod in #20835. This PR leaves the fast implementation disconnected, but fixes it. A separate PR will roll back #20835 after this lands.

Additional Description: The repro can spot the race by reverting the definition of ThreadLocalStoreImpl::forEachScope to to its prior state, taking into account that scopes_ is now a map<ScopeImpl*, weak_ptr<ScopeImpl> rather than a set<ScopeImpl>.

  for (auto iter : scopes_) {
    f_scope(*(iter.first));
  }

Then test/server/admin:stats_handler_test, test will fail wtih

 RUN      ] ThreadedTest.Threaded
terminate called after throwing an instance of 'std::bad_weak_ptr'
  what():  bad_weak_ptr

Risk Level: low -- scope iteration is being fixed here, but that doesn't happen in production yet.
Testing: //test/...
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a

Signed-off-by: Joshua Marantz <jmarantz@google.com>

repokitteh-read-only · 2022-04-16T04:36:25Z

As a reminder, PRs marked as draft will not be automatically assigned reviewers,
or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

🐱

Caused by: #20855 was opened by jmarantz.

see: more, trace.

Signed-off-by: Joshua Marantz <jmarantz@google.com>

mattklein123

Thanks for fixing. Just one test question/comment.

/wait

mattklein123 · 2022-04-18T16:02:58Z

test/server/admin/stats_handler_test.cc


+// Sets up a test using real threads to reproduce a race between deleting scopes
+// and iterating over them.
+class ThreadedTest : public testing::Test {


Instead of a stress style test (or in addition) is it possible to do a thread synchronizer style test that is deterministic that also fails?

I'm looking at it. I definitely don't want it to be lieu of the real-threads stress test. One thing I often find with synchronizers is that once you fix the race, you can't trigger the synchronizer without deadlock, but I'm iterating a bit.

Signed-off-by: Joshua Marantz <jmarantz@google.com>

mattklein123

Nice, check CI?

/wait

jmarantz · 2022-04-18T23:00:03Z

Thanks -- having a hard time sussing out the clang tidy error from the log; maybe an infra flake? I'll retest and if it still fails I'll look more carefully.

/retest

repokitteh-read-only · 2022-04-18T23:00:07Z

Retrying Azure Pipelines:
Check envoy-presubmit isn't fully completed, but will still attempt retrying.
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #20855 (comment) was created by @jmarantz.

see: more, trace.

jmarantz · 2022-04-18T23:57:29Z

/retest

repokitteh-read-only · 2022-04-18T23:57:33Z

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #20855 (comment) was created by @jmarantz.

see: more, trace.

Signed-off-by: Joshua Marantz <jmarantz@google.com>

mattklein123

Thanks!

Commit Message: Rolls back the rollback PR #20835 , re-enabling fast admin stats, now that #20855 has landed. Additional Description: This brings the performance back to this state: ``` ------------------------------------------------------------------ Benchmark Time CPU Iterations ------------------------------------------------------------------ BM_AllCountersText 467 ms 466 ms 2 BM_UsedCountersText 37.0 ms 37.0 ms 19 BM_FilteredCountersText 1793 ms 1792 ms 1 BM_AllCountersJson 504 ms 504 ms 1 BM_UsedCountersJson 37.2 ms 37.2 ms 19 BM_FilteredCountersJson 1839 ms 1839 ms 1 ``` So: around half a second of CPU burst for 1M json & text stats, rather than 1.8 seconds for text and 4.7 seconds for json. We still have a std::regex bottleneck when a filter is specified. Risk Level: medium -- this re-enables calling of code that previously had had races, though #20855 repro'd and fixes them. Testing: //test/... Docs Changes: n/a Release Notes: n/a Platform Specific Features: n/a Signed-off-by: Joshua Marantz <jmarantz@google.com>

Commit Message: Fix Stats::Scope destruct/iterate race by holding onto a weak_ptr<Scope> the scopes_ hash-table in ThreadLocalStore. Also adds GUARDED_BY thread annotation to the scopes_ hash table and refactors a bit to ensure thread safety across all accesses. The thread-safety analysis needs more-than-usual annotation assistance for two reasons: * the analysis system does not see that `ThreadLocalStoreImpl::lock_` and `ThreadLocalStoreImpl::ScopeImpl::parent_.lock_` are the same. * in safeMakeStat call-sites, for code-sharing reasons, we need to take a reference to the guarded `central_cache_` entry before we decide whether we need to take the lock the protects it, so we need to disable analysis in that case. This way we can share the code that finds stats in the TLS-cache without taking locks. A couple of helper methods,`centralCacheLockHeld()` and `centralCacheNoThreadAnalysis()`, were added to allow analysis to run with minimally scoped annotations. A testcase was added which duplicates the race between looping over the stats for admin, and creating/destroying scopes, using the fast /stats implementation that was disconnected in prod in envoyproxy#20835. This PR leaves the fast implementation disconnected, but fixes it. A separate PR will roll back envoyproxy#20835 after this lands. Additional Description: The repro can spot the race by reverting the definition of `ThreadLocalStoreImpl::forEachScope` to to its prior state, taking into account that `scopes_` is now a `map<ScopeImpl*, weak_ptr<ScopeImpl>` rather than a `set<ScopeImpl>`. ``` for (auto iter : scopes_) { f_scope(*(iter.first)); } ``` Then test/server/admin:stats_handler_test, test will fail wtih ``` RUN ] ThreadedTest.Threaded terminate called after throwing an instance of 'std::bad_weak_ptr' what(): bad_weak_ptr ``` Risk Level: low -- scope iteration is being fixed here, but that doesn't happen in production yet. Testing: //test/... Docs Changes: n/a Release Notes: n/a Platform Specific Features: n/a Signed-off-by: Joshua Marantz <jmarantz@google.com>

Commit Message: Rolls back the rollback PR envoyproxy#20835 , re-enabling fast admin stats, now that envoyproxy#20855 has landed. Additional Description: This brings the performance back to this state: ``` ------------------------------------------------------------------ Benchmark Time CPU Iterations ------------------------------------------------------------------ BM_AllCountersText 467 ms 466 ms 2 BM_UsedCountersText 37.0 ms 37.0 ms 19 BM_FilteredCountersText 1793 ms 1792 ms 1 BM_AllCountersJson 504 ms 504 ms 1 BM_UsedCountersJson 37.2 ms 37.2 ms 19 BM_FilteredCountersJson 1839 ms 1839 ms 1 ``` So: around half a second of CPU burst for 1M json & text stats, rather than 1.8 seconds for text and 4.7 seconds for json. We still have a std::regex bottleneck when a filter is specified. Risk Level: medium -- this re-enables calling of code that previously had had races, though envoyproxy#20855 repro'd and fixes them. Testing: //test/... Docs Changes: n/a Release Notes: n/a Platform Specific Features: n/a Signed-off-by: Joshua Marantz <jmarantz@google.com>

Commit Message: Rolls back the rollback PR envoyproxy/envoy#20835 , re-enabling fast admin stats, now that envoyproxy/envoy#20855 has landed. Additional Description: This brings the performance back to this state: ``` ------------------------------------------------------------------ Benchmark Time CPU Iterations ------------------------------------------------------------------ BM_AllCountersText 467 ms 466 ms 2 BM_UsedCountersText 37.0 ms 37.0 ms 19 BM_FilteredCountersText 1793 ms 1792 ms 1 BM_AllCountersJson 504 ms 504 ms 1 BM_UsedCountersJson 37.2 ms 37.2 ms 19 BM_FilteredCountersJson 1839 ms 1839 ms 1 ``` So: around half a second of CPU burst for 1M json & text stats, rather than 1.8 seconds for text and 4.7 seconds for json. We still have a std::regex bottleneck when a filter is specified. Risk Level: medium -- this re-enables calling of code that previously had had races, though #20855 repro'd and fixes them. Testing: //test/... Docs Changes: n/a Release Notes: n/a Platform Specific Features: n/a Signed-off-by: Joshua Marantz <jmarantz@google.com>

jmarantz added 2 commits April 16, 2022 00:28

repro and fix stats-handler race

3c69f0f

Signed-off-by: Joshua Marantz <jmarantz@google.com>

format

4244cb3

Signed-off-by: Joshua Marantz <jmarantz@google.com>

jmarantz added 14 commits April 16, 2022 08:58

fix deadlocks

dd7c1ce

Signed-off-by: Joshua Marantz <jmarantz@google.com>

Merge branch 'main' into repro-admin-stats-crash

63e174c

Signed-off-by: Joshua Marantz <jmarantz@google.com>

minor cleanup

dfcc250

Signed-off-by: Joshua Marantz <jmarantz@google.com>

format

966abe4

Signed-off-by: Joshua Marantz <jmarantz@google.com>

tsan issue in test fixture.

fbeb031

Signed-off-by: Joshua Marantz <jmarantz@google.com>

inline a template function to ensure it's defined.

e6f0124

Signed-off-by: Joshua Marantz <jmarantz@google.com>

cleanup thread safety assertions in header

a1c3ed1

Signed-off-by: Joshua Marantz <jmarantz@google.com>

more thread annotation cleanups

c8c99b6

Signed-off-by: Joshua Marantz <jmarantz@google.com>

final thread annotation cleanup

580e57a

Signed-off-by: Joshua Marantz <jmarantz@google.com>

privatize the central cache object.

b7ae70b

Signed-off-by: Joshua Marantz <jmarantz@google.com>

remove superfluous comment

fbc6c0f

Signed-off-by: Joshua Marantz <jmarantz@google.com>

fix const propagation compile issue on windows

d0464da

Signed-off-by: Joshua Marantz <jmarantz@google.com>

add comments & move some function definitions from h to cc

52cb554

Signed-off-by: Joshua Marantz <jmarantz@google.com>

use ranged for loop

a98fff4

Signed-off-by: Joshua Marantz <jmarantz@google.com>

jmarantz marked this pull request as ready for review April 17, 2022 19:00

jmarantz assigned mattklein123 Apr 17, 2022

jmarantz added a commit to jmarantz/envoy that referenced this pull request Apr 18, 2022

back out envoyproxy#20855

9240e9a

Signed-off-by: Joshua Marantz <jmarantz@google.com>

mattklein123 requested changes Apr 18, 2022

View reviewed changes

repokitteh-read-only bot added the waiting label Apr 18, 2022

jmarantz added 2 commits April 18, 2022 12:10

Merge branch 'main' into repro-admin-stats-crash

c814223

Signed-off-by: Joshua Marantz <jmarantz@google.com>

add thread synchronizer test

cb092df

Signed-off-by: Joshua Marantz <jmarantz@google.com>

repokitteh-read-only bot removed the waiting label Apr 18, 2022

mattklein123 reviewed Apr 18, 2022

View reviewed changes

repokitteh-read-only bot added the waiting label Apr 18, 2022

cleanup

ec8d14e

Signed-off-by: Joshua Marantz <jmarantz@google.com>

repokitteh-read-only bot removed the waiting label Apr 19, 2022

mattklein123 approved these changes Apr 19, 2022

View reviewed changes

jmarantz merged commit aa0ab4c into envoyproxy:main Apr 19, 2022

jmarantz deleted the repro-admin-stats-crash branch April 19, 2022 16:51

jmarantz mentioned this pull request Apr 19, 2022

admin: re-enable fast admin stats #20887

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stats: Repro & fix admin stats crash#20855

stats: Repro & fix admin stats crash#20855
jmarantz merged 19 commits intoenvoyproxy:mainfrom
jmarantz:repro-admin-stats-crash

jmarantz commented Apr 16, 2022 •

edited

Loading

Uh oh!

repokitteh-read-only bot commented Apr 16, 2022

Uh oh!

mattklein123 left a comment

Uh oh!

mattklein123 Apr 18, 2022

Uh oh!

jmarantz Apr 18, 2022

Uh oh!

jmarantz Apr 18, 2022

Uh oh!

mattklein123 left a comment

Uh oh!

jmarantz commented Apr 18, 2022

Uh oh!

repokitteh-read-only bot commented Apr 18, 2022

Uh oh!

jmarantz commented Apr 18, 2022

Uh oh!

repokitteh-read-only bot commented Apr 18, 2022

Uh oh!

mattklein123 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jmarantz commented Apr 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

repokitteh-read-only bot commented Apr 16, 2022

Uh oh!

mattklein123 left a comment

Choose a reason for hiding this comment

Uh oh!

mattklein123 Apr 18, 2022

Choose a reason for hiding this comment

Uh oh!

jmarantz Apr 18, 2022

Choose a reason for hiding this comment

Uh oh!

jmarantz Apr 18, 2022

Choose a reason for hiding this comment

Uh oh!

mattklein123 left a comment

Choose a reason for hiding this comment

Uh oh!

jmarantz commented Apr 18, 2022

Uh oh!

repokitteh-read-only bot commented Apr 18, 2022

Uh oh!

jmarantz commented Apr 18, 2022

Uh oh!

repokitteh-read-only bot commented Apr 18, 2022

Uh oh!

mattklein123 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jmarantz commented Apr 16, 2022 •

edited

Loading