Skip to content

stats: Repro & fix admin stats crash#20855

Merged
jmarantz merged 19 commits intoenvoyproxy:mainfrom
jmarantz:repro-admin-stats-crash
Apr 19, 2022
Merged

stats: Repro & fix admin stats crash#20855
jmarantz merged 19 commits intoenvoyproxy:mainfrom
jmarantz:repro-admin-stats-crash

Conversation

@jmarantz
Copy link
Contributor

@jmarantz jmarantz commented Apr 16, 2022

Commit Message: Fix Stats::Scope destruct/iterate race by holding onto a weak_ptr the scopes_ hash-table in ThreadLocalStore. Also adds GUARDED_BY thread annotation to the scopes_ hash table and refactors a bit to ensure thread safety across all accesses. The thread-safety analysis needs more-than-usual annotation assistance for two reasons:

  • the analysis system does not see that ThreadLocalStoreImpl::lock_ and ThreadLocalStoreImpl::ScopeImpl::parent_.lock_ are the same.
  • in safeMakeStat call-sites, for code-sharing reasons, we need to take a reference to the guarded central_cache_ entry before we decide whether we need to take the lock the protects it, so we need to disable analysis in that case. This way we can share the code that finds stats in the TLS-cache without taking locks.

A couple of helper methods,centralCacheLockHeld() and centralCacheNoThreadAnalysis(), were added to allow analysis to run with minimally scoped annotations.

A testcase was added which duplicates the race between looping over the stats for admin, and creating/destroying scopes, using the fast /stats implementation that was disconnected in prod in #20835. This PR leaves the fast implementation disconnected, but fixes it. A separate PR will roll back #20835 after this lands.

Additional Description: The repro can spot the race by reverting the definition of ThreadLocalStoreImpl::forEachScope to to its prior state, taking into account that scopes_ is now a map<ScopeImpl*, weak_ptr<ScopeImpl> rather than a set<ScopeImpl>.

  for (auto iter : scopes_) {
    f_scope(*(iter.first));
  }

Then test/server/admin:stats_handler_test, test will fail wtih

 RUN      ] ThreadedTest.Threaded
terminate called after throwing an instance of 'std::bad_weak_ptr'
  what():  bad_weak_ptr

Risk Level: low -- scope iteration is being fixed here, but that doesn't happen in production yet.
Testing: //test/...
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a

Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
@repokitteh-read-only
Copy link

As a reminder, PRs marked as draft will not be automatically assigned reviewers,
or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

🐱

Caused by: #20855 was opened by jmarantz.

see: more, trace.

jmarantz added 14 commits April 16, 2022 08:58
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
@jmarantz jmarantz marked this pull request as ready for review April 17, 2022 19:00
jmarantz added a commit to jmarantz/envoy that referenced this pull request Apr 18, 2022
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing. Just one test question/comment.

/wait


// Sets up a test using real threads to reproduce a race between deleting scopes
// and iterating over them.
class ThreadedTest : public testing::Test {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a stress style test (or in addition) is it possible to do a thread synchronizer style test that is deterministic that also fails?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm looking at it. I definitely don't want it to be lieu of the real-threads stress test. One thing I often find with synchronizers is that once you fix the race, you can't trigger the synchronizer without deadlock, but I'm iterating a bit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, check CI?

/wait

@jmarantz
Copy link
Contributor Author

Thanks -- having a hard time sussing out the clang tidy error from the log; maybe an infra flake? I'll retest and if it still fails I'll look more carefully.

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines:
Check envoy-presubmit isn't fully completed, but will still attempt retrying.
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #20855 (comment) was created by @jmarantz.

see: more, trace.

@jmarantz
Copy link
Contributor Author

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #20855 (comment) was created by @jmarantz.

see: more, trace.

Signed-off-by: Joshua Marantz <jmarantz@google.com>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@jmarantz jmarantz merged commit aa0ab4c into envoyproxy:main Apr 19, 2022
@jmarantz jmarantz deleted the repro-admin-stats-crash branch April 19, 2022 16:51
jmarantz added a commit that referenced this pull request Apr 20, 2022
Commit Message: Rolls back the rollback PR #20835 , re-enabling fast admin stats, now that #20855 has landed.
Additional Description:
This brings the performance back to this state:
```
------------------------------------------------------------------
Benchmark                        Time             CPU   Iterations
------------------------------------------------------------------
BM_AllCountersText             467 ms          466 ms            2
BM_UsedCountersText           37.0 ms         37.0 ms           19
BM_FilteredCountersText       1793 ms         1792 ms            1
BM_AllCountersJson             504 ms          504 ms            1
BM_UsedCountersJson           37.2 ms         37.2 ms           19
BM_FilteredCountersJson       1839 ms         1839 ms            1
```
So: around half a second of CPU burst for 1M json & text stats, rather than 1.8 seconds for text and 4.7 seconds for json. We still have a std::regex bottleneck when a filter is specified.
Risk Level: medium -- this re-enables calling of code that previously had had races, though #20855 repro'd and fixes them.
Testing: //test/...
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a

Signed-off-by: Joshua Marantz <jmarantz@google.com>
ravenblackx pushed a commit to ravenblackx/envoy that referenced this pull request Jun 8, 2022
Commit Message: Fix Stats::Scope destruct/iterate race by holding onto a weak_ptr<Scope> the scopes_ hash-table in ThreadLocalStore. Also adds GUARDED_BY thread annotation to the scopes_ hash table and refactors a bit to ensure thread safety across all accesses. The thread-safety analysis needs more-than-usual annotation assistance for two reasons:
 * the analysis system does not see that `ThreadLocalStoreImpl::lock_` and `ThreadLocalStoreImpl::ScopeImpl::parent_.lock_` are the same.
 * in safeMakeStat call-sites, for code-sharing reasons, we need to take a reference to the guarded `central_cache_` entry before we decide whether we need to take the lock the protects it, so we need to disable analysis in that case. This way we can share the code that finds stats in the TLS-cache without taking locks.

A couple of helper methods,`centralCacheLockHeld()` and `centralCacheNoThreadAnalysis()`, were added to allow analysis to run with minimally scoped annotations.

A testcase was added which duplicates the race between looping over the stats for admin, and creating/destroying scopes, using the fast /stats implementation that was disconnected in prod in envoyproxy#20835. This PR leaves the fast implementation disconnected, but fixes it. A separate PR will roll back envoyproxy#20835 after this lands.

Additional Description: The repro can spot the race by reverting the definition of `ThreadLocalStoreImpl::forEachScope` to to its prior state, taking into account that `scopes_` is now a `map<ScopeImpl*, weak_ptr<ScopeImpl>` rather than a `set<ScopeImpl>`.
```
  for (auto iter : scopes_) {
    f_scope(*(iter.first));
  }
```
 Then test/server/admin:stats_handler_test, test will fail wtih
```
 RUN      ] ThreadedTest.Threaded
terminate called after throwing an instance of 'std::bad_weak_ptr'
  what():  bad_weak_ptr
```
Risk Level: low -- scope iteration is being fixed here, but that doesn't happen in production yet.
Testing: //test/...
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a

Signed-off-by: Joshua Marantz <jmarantz@google.com>
ravenblackx pushed a commit to ravenblackx/envoy that referenced this pull request Jun 8, 2022
Commit Message: Rolls back the rollback PR envoyproxy#20835 , re-enabling fast admin stats, now that envoyproxy#20855 has landed.
Additional Description:
This brings the performance back to this state:
```
------------------------------------------------------------------
Benchmark                        Time             CPU   Iterations
------------------------------------------------------------------
BM_AllCountersText             467 ms          466 ms            2
BM_UsedCountersText           37.0 ms         37.0 ms           19
BM_FilteredCountersText       1793 ms         1792 ms            1
BM_AllCountersJson             504 ms          504 ms            1
BM_UsedCountersJson           37.2 ms         37.2 ms           19
BM_FilteredCountersJson       1839 ms         1839 ms            1
```
So: around half a second of CPU burst for 1M json & text stats, rather than 1.8 seconds for text and 4.7 seconds for json. We still have a std::regex bottleneck when a filter is specified.
Risk Level: medium -- this re-enables calling of code that previously had had races, though envoyproxy#20855 repro'd and fixes them.
Testing: //test/...
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a

Signed-off-by: Joshua Marantz <jmarantz@google.com>
oschaaf pushed a commit to maistra/envoy that referenced this pull request Oct 26, 2022
Commit Message: Rolls back the rollback PR envoyproxy/envoy#20835 , re-enabling fast admin stats, now that envoyproxy/envoy#20855 has landed.
Additional Description:
This brings the performance back to this state:
```
------------------------------------------------------------------
Benchmark                        Time             CPU   Iterations
------------------------------------------------------------------
BM_AllCountersText             467 ms          466 ms            2
BM_UsedCountersText           37.0 ms         37.0 ms           19
BM_FilteredCountersText       1793 ms         1792 ms            1
BM_AllCountersJson             504 ms          504 ms            1
BM_UsedCountersJson           37.2 ms         37.2 ms           19
BM_FilteredCountersJson       1839 ms         1839 ms            1
```
So: around half a second of CPU burst for 1M json & text stats, rather than 1.8 seconds for text and 4.7 seconds for json. We still have a std::regex bottleneck when a filter is specified.
Risk Level: medium -- this re-enables calling of code that previously had had races, though #20855 repro'd and fixes them.
Testing: //test/...
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a

Signed-off-by: Joshua Marantz <jmarantz@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants