Skip to content

admin: roll back stats_handler.cc et al back to previous state while we sort out some sporadic crashes#20835

Merged
mattklein123 merged 6 commits intoenvoyproxy:mainfrom
jmarantz:revert-admin-stat-speedup
Apr 15, 2022
Merged

admin: roll back stats_handler.cc et al back to previous state while we sort out some sporadic crashes#20835
mattklein123 merged 6 commits intoenvoyproxy:mainfrom
jmarantz:revert-admin-stat-speedup

Conversation

@jmarantz
Copy link
Contributor

@jmarantz jmarantz commented Apr 14, 2022

Commit Message: Rolls back the operational part of #19693 which has seen some sporadic crashes in some environments. This does not fully roll back that PR (which has had some follow-on PRs) but it rolls back the operational code behind the /stats endpoint.

I suspect some race issue, combined with shared_ptr vs unique_ptr semantics, and probaby a system-wide stress test with tsan will help.
Additional Description: This reverts back the performance of the /stats endpoint for 1M stats to:

Benchmark                        Time             CPU   Iterations
------------------------------------------------------------------
BM_AllCountersText            1813 ms         1812 ms            1
BM_UsedCountersText            133 ms          133 ms            5
BM_FilteredCountersText       4510 ms         4509 ms            1
BM_AllCountersJson            4738 ms         4735 ms            1
BM_UsedCountersJson            158 ms          158 ms            4
BM_FilteredCountersJson       4504 ms         4503 ms            1

the current main has this performance:

------------------------------------------------------------------
Benchmark                        Time             CPU   Iterations
------------------------------------------------------------------
BM_AllCountersText             458 ms          458 ms            2
BM_UsedCountersText           36.1 ms         36.1 ms           19
BM_FilteredCountersText       1790 ms         1790 ms            1
BM_AllCountersJson             496 ms          496 ms            2
BM_UsedCountersJson           36.4 ms         36.4 ms           19
BM_FilteredCountersJson       1789 ms         1789 ms            1

Risk Level: medium -- had to manually tweak the rollback as it was not 100% clean
Testing: //test/...
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a

…me issues that were discovered

Signed-off-by: Joshua Marantz <jmarantz@google.com>
@repokitteh-read-only
Copy link

As a reminder, PRs marked as draft will not be automatically assigned reviewers,
or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

🐱

Caused by: #20835 was opened by jmarantz.

see: more, trace.

Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
@jmarantz jmarantz changed the title WiP revert stats_handler.cc et al back to previous state while we sort some issues that were discovered admin: rol back stats_handler.cc et al back to previous state while we sort out some sporadic crashes Apr 14, 2022
@jmarantz jmarantz changed the title admin: rol back stats_handler.cc et al back to previous state while we sort out some sporadic crashes admin: roll back stats_handler.cc et al back to previous state while we sort out some sporadic crashes Apr 14, 2022
@jmarantz jmarantz marked this pull request as ready for review April 14, 2022 21:55
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@mattklein123 mattklein123 merged commit d9f614d into envoyproxy:main Apr 15, 2022
@jmarantz jmarantz deleted the revert-admin-stat-speedup branch April 15, 2022 15:46
vehre-x41 pushed a commit to vehre-x41/envoy that referenced this pull request Apr 19, 2022
…we sort out some sporadic crashes (envoyproxy#20835)

Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Andre Vehreschild <vehre@x41-dsec.de>
jmarantz added a commit that referenced this pull request Apr 19, 2022
Commit Message: Fix Stats::Scope destruct/iterate race by holding onto a weak_ptr<Scope> the scopes_ hash-table in ThreadLocalStore. Also adds GUARDED_BY thread annotation to the scopes_ hash table and refactors a bit to ensure thread safety across all accesses. The thread-safety analysis needs more-than-usual annotation assistance for two reasons:
 * the analysis system does not see that `ThreadLocalStoreImpl::lock_` and `ThreadLocalStoreImpl::ScopeImpl::parent_.lock_` are the same.
 * in safeMakeStat call-sites, for code-sharing reasons, we need to take a reference to the guarded `central_cache_` entry before we decide whether we need to take the lock the protects it, so we need to disable analysis in that case. This way we can share the code that finds stats in the TLS-cache without taking locks.

A couple of helper methods,`centralCacheLockHeld()` and `centralCacheNoThreadAnalysis()`, were added to allow analysis to run with minimally scoped annotations.

A testcase was added which duplicates the race between looping over the stats for admin, and creating/destroying scopes, using the fast /stats implementation that was disconnected in prod in #20835. This PR leaves the fast implementation disconnected, but fixes it. A separate PR will roll back #20835 after this lands.

Additional Description: The repro can spot the race by reverting the definition of `ThreadLocalStoreImpl::forEachScope` to to its prior state, taking into account that `scopes_` is now a `map<ScopeImpl*, weak_ptr<ScopeImpl>` rather than a `set<ScopeImpl>`.
```
  for (auto iter : scopes_) {
    f_scope(*(iter.first));
  }
```
 Then test/server/admin:stats_handler_test, test will fail wtih
```
 RUN      ] ThreadedTest.Threaded
terminate called after throwing an instance of 'std::bad_weak_ptr'
  what():  bad_weak_ptr
```
Risk Level: low -- scope iteration is being fixed here, but that doesn't happen in production yet.
Testing: //test/...
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a

Signed-off-by: Joshua Marantz <jmarantz@google.com>
jmarantz added a commit that referenced this pull request Apr 20, 2022
Commit Message: Rolls back the rollback PR #20835 , re-enabling fast admin stats, now that #20855 has landed.
Additional Description:
This brings the performance back to this state:
```
------------------------------------------------------------------
Benchmark                        Time             CPU   Iterations
------------------------------------------------------------------
BM_AllCountersText             467 ms          466 ms            2
BM_UsedCountersText           37.0 ms         37.0 ms           19
BM_FilteredCountersText       1793 ms         1792 ms            1
BM_AllCountersJson             504 ms          504 ms            1
BM_UsedCountersJson           37.2 ms         37.2 ms           19
BM_FilteredCountersJson       1839 ms         1839 ms            1
```
So: around half a second of CPU burst for 1M json & text stats, rather than 1.8 seconds for text and 4.7 seconds for json. We still have a std::regex bottleneck when a filter is specified.
Risk Level: medium -- this re-enables calling of code that previously had had races, though #20855 repro'd and fixes them.
Testing: //test/...
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a

Signed-off-by: Joshua Marantz <jmarantz@google.com>
ravenblackx pushed a commit to ravenblackx/envoy that referenced this pull request Jun 8, 2022
…we sort out some sporadic crashes (envoyproxy#20835)

Signed-off-by: Joshua Marantz <jmarantz@google.com>
ravenblackx pushed a commit to ravenblackx/envoy that referenced this pull request Jun 8, 2022
Commit Message: Fix Stats::Scope destruct/iterate race by holding onto a weak_ptr<Scope> the scopes_ hash-table in ThreadLocalStore. Also adds GUARDED_BY thread annotation to the scopes_ hash table and refactors a bit to ensure thread safety across all accesses. The thread-safety analysis needs more-than-usual annotation assistance for two reasons:
 * the analysis system does not see that `ThreadLocalStoreImpl::lock_` and `ThreadLocalStoreImpl::ScopeImpl::parent_.lock_` are the same.
 * in safeMakeStat call-sites, for code-sharing reasons, we need to take a reference to the guarded `central_cache_` entry before we decide whether we need to take the lock the protects it, so we need to disable analysis in that case. This way we can share the code that finds stats in the TLS-cache without taking locks.

A couple of helper methods,`centralCacheLockHeld()` and `centralCacheNoThreadAnalysis()`, were added to allow analysis to run with minimally scoped annotations.

A testcase was added which duplicates the race between looping over the stats for admin, and creating/destroying scopes, using the fast /stats implementation that was disconnected in prod in envoyproxy#20835. This PR leaves the fast implementation disconnected, but fixes it. A separate PR will roll back envoyproxy#20835 after this lands.

Additional Description: The repro can spot the race by reverting the definition of `ThreadLocalStoreImpl::forEachScope` to to its prior state, taking into account that `scopes_` is now a `map<ScopeImpl*, weak_ptr<ScopeImpl>` rather than a `set<ScopeImpl>`.
```
  for (auto iter : scopes_) {
    f_scope(*(iter.first));
  }
```
 Then test/server/admin:stats_handler_test, test will fail wtih
```
 RUN      ] ThreadedTest.Threaded
terminate called after throwing an instance of 'std::bad_weak_ptr'
  what():  bad_weak_ptr
```
Risk Level: low -- scope iteration is being fixed here, but that doesn't happen in production yet.
Testing: //test/...
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a

Signed-off-by: Joshua Marantz <jmarantz@google.com>
ravenblackx pushed a commit to ravenblackx/envoy that referenced this pull request Jun 8, 2022
Commit Message: Rolls back the rollback PR envoyproxy#20835 , re-enabling fast admin stats, now that envoyproxy#20855 has landed.
Additional Description:
This brings the performance back to this state:
```
------------------------------------------------------------------
Benchmark                        Time             CPU   Iterations
------------------------------------------------------------------
BM_AllCountersText             467 ms          466 ms            2
BM_UsedCountersText           37.0 ms         37.0 ms           19
BM_FilteredCountersText       1793 ms         1792 ms            1
BM_AllCountersJson             504 ms          504 ms            1
BM_UsedCountersJson           37.2 ms         37.2 ms           19
BM_FilteredCountersJson       1839 ms         1839 ms            1
```
So: around half a second of CPU burst for 1M json & text stats, rather than 1.8 seconds for text and 4.7 seconds for json. We still have a std::regex bottleneck when a filter is specified.
Risk Level: medium -- this re-enables calling of code that previously had had races, though envoyproxy#20855 repro'd and fixes them.
Testing: //test/...
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a

Signed-off-by: Joshua Marantz <jmarantz@google.com>
oschaaf pushed a commit to maistra/envoy that referenced this pull request Oct 26, 2022
Commit Message: Fix Stats::Scope destruct/iterate race by holding onto a weak_ptr<Scope> the scopes_ hash-table in ThreadLocalStore. Also adds GUARDED_BY thread annotation to the scopes_ hash table and refactors a bit to ensure thread safety across all accesses. The thread-safety analysis needs more-than-usual annotation assistance for two reasons:
 * the analysis system does not see that `ThreadLocalStoreImpl::lock_` and `ThreadLocalStoreImpl::ScopeImpl::parent_.lock_` are the same.
 * in safeMakeStat call-sites, for code-sharing reasons, we need to take a reference to the guarded `central_cache_` entry before we decide whether we need to take the lock the protects it, so we need to disable analysis in that case. This way we can share the code that finds stats in the TLS-cache without taking locks.

A couple of helper methods,`centralCacheLockHeld()` and `centralCacheNoThreadAnalysis()`, were added to allow analysis to run with minimally scoped annotations.

A testcase was added which duplicates the race between looping over the stats for admin, and creating/destroying scopes, using the fast /stats implementation that was disconnected in prod in envoyproxy/envoy#20835. This PR leaves the fast implementation disconnected, but fixes it. A separate PR will roll back #20835 after this lands.

Additional Description: The repro can spot the race by reverting the definition of `ThreadLocalStoreImpl::forEachScope` to to its prior state, taking into account that `scopes_` is now a `map<ScopeImpl*, weak_ptr<ScopeImpl>` rather than a `set<ScopeImpl>`.
```
  for (auto iter : scopes_) {
    f_scope(*(iter.first));
  }
```
 Then test/server/admin:stats_handler_test, test will fail wtih
```
 RUN      ] ThreadedTest.Threaded
terminate called after throwing an instance of 'std::bad_weak_ptr'
  what():  bad_weak_ptr
```
Risk Level: low -- scope iteration is being fixed here, but that doesn't happen in production yet.
Testing: //test/...
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a

Signed-off-by: Joshua Marantz <jmarantz@google.com>
oschaaf pushed a commit to maistra/envoy that referenced this pull request Oct 26, 2022
Commit Message: Rolls back the rollback PR envoyproxy/envoy#20835 , re-enabling fast admin stats, now that envoyproxy/envoy#20855 has landed.
Additional Description:
This brings the performance back to this state:
```
------------------------------------------------------------------
Benchmark                        Time             CPU   Iterations
------------------------------------------------------------------
BM_AllCountersText             467 ms          466 ms            2
BM_UsedCountersText           37.0 ms         37.0 ms           19
BM_FilteredCountersText       1793 ms         1792 ms            1
BM_AllCountersJson             504 ms          504 ms            1
BM_UsedCountersJson           37.2 ms         37.2 ms           19
BM_FilteredCountersJson       1839 ms         1839 ms            1
```
So: around half a second of CPU burst for 1M json & text stats, rather than 1.8 seconds for text and 4.7 seconds for json. We still have a std::regex bottleneck when a filter is specified.
Risk Level: medium -- this re-enables calling of code that previously had had races, though #20855 repro'd and fixes them.
Testing: //test/...
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a

Signed-off-by: Joshua Marantz <jmarantz@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants