stats: Improve performance of clearing scopes and histograms by batching them by jtway · Pull Request #15876 · envoyproxy/envoy

jtway · 2021-04-07T17:55:50Z

Previously a post was required per histogram or scope, per thread. This greatly reduces the overhead of large config updates and when tens of thousands of histograms and scopes are queued for release in short order. This work is based off of https://gist.github.com/jmarantz/838cb6de7e74c0970ea6b63eded0139a

Co-authored-by: Joshua Marantz jmarantz@google.com
Signed-off-by: Josh Tway josh.tway@stackpath.com

Additional Description: Came across the original patch by @jmarantz while investigating why updating a single VirtualHost, via VHDS, would frequently take 50 seconds, or more, to propagate. Turns out it was caused by the large number of Scopes being cleared for each update. Batching scopes and histogram ids queued for removal together improves this to under 20 seconds, often below 10 seconds.
Risk Level: Low
Testing: Before and after testing of how long it took VHDS, when > 100K VirtualHosts were present took to update.
Docs Changes: N/A
Release Notes: N/a
Platform Specific Features: N/A

…ing them. Previously a post was required per histogram or scope, per thread. This greatly reduces the overhead of large config updates and when tens of thousands of histograms and scopes are queued for release in short order. Co-authored-by: Joshua Marantz <jmarantz@google.com> Signed-off-by: Josh Tway <josh.tway@stackpath.com>

repokitteh-read-only · 2021-04-07T17:55:53Z

Hi @jtway, welcome and thank you for your contribution.

We will try to review your Pull Request as quickly as possible.

In the meantime, please take a look at the contribution guidelines if you have not done so already.

🐱

Caused by: #15876 was opened by jtway.

see: more, trace.

jmarantz · 2021-04-08T00:57:52Z

Looks like you have an issue with the test for the main file you are modifying. You might want to iterate locally with:

bazel test test/common/stats:thread_local_store_test

prior to pushing to CI.

jtway · 2021-04-08T01:02:05Z

@jmarantz it didn't fail locally, but I did see this. A colleague was able to recreate, so I will be looking in the morning.

jmarantz

It makes sense that this speeds things up significantly, but not by an order of magnitude, by saving a lot of indirect function calls. I don't think it reduce the amount of times we need to take a lock, but it does reduce the indirect function calls through post().

The fact that control-plane things take 10s of seconds though still seems bad to me, and makes me wonder if we need more system-thinking about this problem of scale.

jmarantz · 2021-04-08T12:48:10Z

source/common/stats/thread_local_store.cc

  if (!shutting_down_ && main_thread_dispatcher_) {
-    const uint64_t scope_id = scope->scope_id_;
+    // Switch to batching of clearing scopes as mentioned in:
+    // https://gist.github.com/jmarantz/838cb6de7e74c0970ea6b63eded0139a


I think we don't need to call out the gist anymore since we have incorporated the code.

Will remove as I find and fix why the integration test failed in the pipeline.

jmarantz · 2021-04-08T12:54:45Z

source/common/stats/thread_local_store.cc

+    auto central_caches = std::make_shared<std::vector<CentralCacheEntrySharedPtr>>();
+    {
+      Thread::LockGuard lock(lock_);
+      scope_ids->swap(scopes_to_cleanup_);


I always liked this swap() pattern by @antoniovicente has convinced me it's faster to do:

*scope_ids = std::move(scopes_to_cleanup_); scopes_to_cleanup_.clear();

as there is no chance in this context that *scope_ids had anything in it. You can use that in place of all the swap()s added in this PR.

Note you do need that clear() because std::move does not strictly guarantee what state it leaves the source in, but it should be a fast no-op.

Interesting, never thought about using something like clear() to return something like this to a "known" state after std::move

Yeah I believe that is the only legal operation you can do on an object that has just had its contents moved...other than destruct it.

The pattern used elsewhere is:
*scope_ids = std::move(scopes_to_cleanup_);
ASSERT(scopes_to_cleanup_.empty());

envoy/source/common/event/dispatcher_impl.cc

Line 370 in c6c2a87

// post_callbacks_ should be empty after the move.

source/common/stats/thread_local_store.cc

jtway · 2021-04-08T13:40:57Z

The fact that control-plane things take 10s of seconds though still seems bad to me, and makes me wonder if we need more system-thinking about this problem of scale.

@jmarantz I agree the time control-plane things take still stands room for much improvement. I am viewing this as too problems: 1. Address how long large config updates take, 2. Address why a single VirtualHost being updated, when there are > 100K VirtualHosts present, ends up being a large config update.

We are still looking into the later, but perhaps I should open an issue for that as well. In our particular case, it appears the problem comes from every VirtualHost being recreated even when a single VirtualHost is present in a VHDS update. Then the existing config being destroyed after the new config is propagated, which caused the VirtualHostImpl destructor to run, and in turn the ScopeImpl destructor, for every VirtualHost in the previous config. I could be wrong there, but that appears to be the case.

jtway · 2021-04-08T20:04:40Z

@jmarantz getting a few compounding problems, even when trying the fix. How I couldn't recreate yesterday, I am unsure. I am in the process of moving, so it may take me a little time to make sure I get the right fix in here. I know the goal is to try to make sure these get wrapped up in < 7 days. Hopefully I will be able to wrap it up by then, but if not, I hope that is okay here.

jmarantz · 2021-04-13T12:39:11Z

/wait

jtway · 2021-04-22T17:21:58Z

Back from my PTO now, but a couple unrelated issues have needed my attention. I should be looking to wrap this up in the next couple days. I'm currently having a couple issues with the integration test due to the addition of runOnAllThreads in the shutdownThreading logic.

… scopes and histograms by batching them Signed-off-by: Josh Tway <josh.tway@stackpath.com>

jtway · 2021-04-27T18:23:10Z

Looks like this broke other tests, guess that's what I get for focusing on this. The problem is that, usually shutdownGlobalThreading has been called, and that means we can't use runOnAllThreads. I'm thinking this through a little.

jmarantz · 2021-04-28T13:02:39Z

ping again when ready

/wait

…en shutting threading down Signed-off-by: Josh Tway <josh.tway@stackpath.com>

Signed-off-by: Josh Tway <josh.tway@stackpath.com>

test/common/stats/thread_local_store_test.cc

…ading calls. This way we cover both when it has been called, and when it has not Signed-off-by: Josh Tway <josh.tway@stackpath.com>

jmarantz

Nice! Just a few nits.

source/common/stats/thread_local_store.cc

jmarantz · 2021-04-30T12:29:48Z

source/common/stats/thread_local_store.cc

-    const uint64_t scope_id = scope->scope_id_;
+    // Switch to batching of clearing scopes. This greatly reduces the overhead when there are
+    // tens of thousands of scopes to clear in a short period. i.e.: VHDS updates with tens of
+    // thousands of VirtualHosts


nit: end sentences with period.

jmarantz · 2021-04-30T12:30:07Z

source/common/stats/thread_local_store.cc

  // cache flush operation.
  if (!shutting_down_ && main_thread_dispatcher_) {
-    const uint64_t scope_id = scope->scope_id_;
+    // Switch to batching of clearing scopes. This greatly reduces the overhead when there are


nit: s/Switch to batching of clearing scopes/Clear scopes in a batch/ as the reader of this won't be anchoring on the previous implementation.

will change

test/common/stats/thread_local_store_test.cc

Signed-off-by: Josh Tway <josh.tway@stackpath.com>

…ived from jmarantz Signed-off-by: Josh Tway <josh.tway@stackpath.com>

jmarantz

I think this looks good now, but I'm biased since I help write some of this. I think the big question here is whether the improved performance is justified by the increased complexity, but the benefits are outlined in the description. So that's a good start. We also have seen similar issues internally, so are hoping this will be submittable.

/assign-from @envoyproxy/senior-maintainers

repokitteh-read-only · 2021-05-07T14:55:49Z

@envoyproxy/senior-maintainers assignee is @snowp

🐱

Caused by: a #15876 (review) was submitted by @jmarantz.

see: more, trace.

jtway · 2021-05-07T15:03:56Z

Thanks @jmarantz

snowp

Thanks I think this batching logic makes sense. Can you explain why we have all these changes to shutdownThreading et al? Not clear from my first pass

snowp · 2021-05-07T15:28:32Z

source/common/stats/thread_local_store.cc

+  if (tls_.has_value()) {
+    ASSERT(tls_->isShutdown());
+  }


maybe do ASSERT(!tls_.has_value() || tls_->isShutdown())? Right now this turns into an empty if block in NDEBUG builds (which would get optimized out, so it doesn't matter that much)

will change

snowp · 2021-05-07T15:29:17Z

source/common/stats/thread_local_store.cc

+  }
+
+  // We can't call runOnAllThreads here as global threading has already been shutdown.
+  // It is okay to simply clear the scopes and central cache entries to cleanup.


Can you include an explanation for why this is okay to do?

Will update. How is:
// We can't call runOnAllThreads here as global threading has already been shutdown. It is okay // to simply clear the scopes and central cache entries here as they will be cleaned up during thread local data // cleanup in InstanceImpl::shutdownThread().

snowp · 2021-05-07T15:35:24Z

source/common/stats/thread_local_store.cc

+    // Capture all the pending histograms in a local, clearing the list held in
+    // this. Once this occurs, if a new histogram is deleted, a new post will be


This comment is a bit hard to read because talking about this makes the sentence seem incomplete. Maybe rephrase? Something like Move the histograms pending cleanup into a local variable. Future histogram deletions will be batched until the next time this function is called.?

Will reword

snowp · 2021-05-07T15:38:19Z

test/common/stats/thread_local_store_speed_test.cc

    for (auto& stat_name_storage : stat_names_) {
      stat_name_storage->free(symbol_table_);
    }
-    store_.shutdownThreading();


What's the purpose of this change?

In production code we were shutting down threading this way, where ThraedLocal::shutdownGlobalThreading() was called first. That means in shutdownThreading() we can't then go through the runOAllThreads sequence to clean up; we just have to remove the elements directly.

This change, and others in some tests, are to make tests shutdown threading in the same order as production code. The ASSERT(tls_->isShutdown()), validates the claim made below it:

// We can't call runOnAllThreads here as global threading has already been shutdown.

and a lot of the trivial test changes are to avoid having that assert crash in unit tests.

snowp · 2021-05-07T15:41:45Z

test/server/admin/stats_handler_test.cc

  EXPECT_THAT(expected_json, JsonStringEq(actual_json));
-  store_->shutdownThreading();
+  shutdownThreading();
+  ENVOY_LOG_MISC(error, "end of StatsAsJson");


I don't think this log is necessary

will remove

snowp · 2021-05-07T15:45:23Z

source/common/stats/thread_local_store.cc

+    // Clear scopes in a batch. This greatly reduces the overhead when there are
+    // tens of thousands of scopes to clear in a short period. i.e.: VHDS updates with tens of
+    // thousands of VirtualHosts.
+    bool need_post = scopes_to_cleanup_.empty();


Mind including an explanation why we post when this is empty? I assume this is the batching mechanism so that we don't post while there is a post in progress, but some detail would be nice to have here

updating releaseScopesCrossThread to have some of the same info as releaseHistogramCrossThread. This should cover this and the additional comment below. :)

snowp · 2021-05-07T15:48:15Z

source/common/stats/thread_local_store.cc

+    // clearHistogramsFromCaches. If a new histogram is deleted before that
+    // post runs, we add it to our list of histograms to clear, and there's no
+    // need to issue another post.


I see this is explained here, would be good to have this explanation for the above scope release :)

jtway · 2021-05-17T17:20:25Z

I'm clearly missing something with this latest failure. I'm not even seeing a multiplexed_integration_test. Only a multiplexed_upstream_integration_test

jmarantz · 2021-05-17T17:25:41Z

can you try to merge main?

jtway · 2021-05-17T17:27:46Z

Will do, been waiting for the bulk of review to finish prior to. Pretty sure that's in the contribution guidelines. If not, will keep things up to date in the future.

jmarantz · 2021-05-17T17:28:59Z

The reason I suggested that is that I think some of the tests merge main automatically first, so they can report an error in a test that you cannot even see in your client.

Signed-off-by: Josh Tway <josh.tway@stackpath.com>

jtway · 2021-05-17T17:31:57Z

Got an error because some don't have DCO sign-off, but they are part of what was merged from upstream. In this case it is okay to use the --no-verify right?

jtway · 2021-05-17T19:37:05Z

Looks like that was all successful

jmarantz

Looks good from my end, thank for pushing this all the way through!

@snowp for final approval.

jtway · 2021-05-19T13:59:17Z

@jmarantz Thanks, and no worries. Hoping to get the final 👍 from @snowp :)

snowp

Just one small comment, otherwise this LGTM!

snowp · 2021-05-20T11:14:20Z

source/common/stats/thread_local_store.cc

  // This will block both future cache fills as well as cache flushes.
  shutting_down_ = true;
+
+  if (tls_.has_value()) {


Don't need this outer if check anymore

Signed-off-by: Josh Tway <josh.tway@stackpath.com>

jtway · 2021-05-20T14:28:58Z

Looks like it failed in a flake way. I'm going to kick it.

Signed-off-by: Josh Tway <josh.tway@stackpath.com>

jtway · 2021-05-20T15:40:31Z

I did a kick-ci, but then a colleague informed me that there were some recent merges in main that might be needed, so I merged and pushed that. I hope the next go will be clean (after the current).

snowp · 2021-05-20T17:17:14Z

You might need another main merge, I think main was broken earlier

Signed-off-by: Josh Tway <josh.tway@stackpath.com>

jtway · 2021-05-21T13:55:58Z

Looks like I don't have the ability to rerun jobs, and not sure as to why the windows CI one is all of a sudden failing. Can some one help me out here?

jmarantz · 2021-05-21T14:01:21Z

WIndows failed 3/7 times in a protoocol integration test. Re-running. I think you can also do this via:

/retest

repokitteh-read-only · 2021-05-21T14:01:25Z

Retrying Azure Pipelines:
Check envoy-presubmit isn't fully completed, but will still attempt retrying.
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #15876 (comment) was created by @jmarantz.

see: more, trace.

jtway · 2021-05-21T14:07:12Z

Good to know! For some reason I read that part of the documentation as comment in the code.

jtway · 2021-05-21T18:46:46Z

Looks like the retest got kicked off before the previous finished or more results that I can not explain (as I have not seen them previously).

/retest

repokitteh-read-only · 2021-05-21T18:46:49Z

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #15876 (comment) was created by @jtway.

see: more, trace.

jtway · 2021-05-21T19:46:52Z

all checks passed :)

jtway · 2021-05-21T19:49:34Z

@snowp it is still showing that there is a change being requested, but I'm pretty sure I addressed everything you wanted.

all concerns addressed

@jmarantz

…ing them (envoyproxy#15876) Previously a post was required per histogram or scope, per thread. This greatly reduces the overhead of large config updates and when tens of thousands of histograms and scopes are queued for release in short order. This work is based off of https://gist.github.com/jmarantz/838cb6de7e74c0970ea6b63eded0139a Co-authored-by: Joshua Marantz jmarantz@google.com Signed-off-by: Josh Tway josh.tway@stackpath.com Additional Description: Came across the original patch by @jmarantz while investigating why updating a single VirtualHost, via VHDS, would frequently take 50 seconds, or more, to propagate. Turns out it was caused by the large number of Scopes being cleared for each update. Batching scopes and histogram ids queued for removal together improves this to under 20 seconds, often below 10 seconds. Risk Level: Low Testing: Before and after testing of how long it took VHDS, when > 100K VirtualHosts were present took to update. Docs Changes: N/A Release Notes: N/a Platform Specific Features: N/A Co-authored-by: Joshua Marantz <jmarantz@google.com> Signed-off-by: Josh Tway <josh.tway@stackpath.com>

snowp assigned jmarantz Apr 7, 2021

jmarantz reviewed Apr 8, 2021

View reviewed changes

repokitteh-read-only bot added the waiting label Apr 13, 2021

stats: Fixed integration test after improving performance of clearing…

02eec84

… scopes and histograms by batching them Signed-off-by: Josh Tway <josh.tway@stackpath.com>

repokitteh-read-only bot removed the waiting label Apr 27, 2021

repokitteh-read-only bot added the waiting label Apr 28, 2021

jtway added 3 commits April 29, 2021 13:45

Integration test fixes and ensuring we have cleared central caches wh…

b9d7e90

…en shutting threading down Signed-off-by: Josh Tway <josh.tway@stackpath.com>

Formatting fixes

d404db3

Signed-off-by: Josh Tway <josh.tway@stackpath.com>

Fixed typo in comment

a7f5658

Signed-off-by: Josh Tway <josh.tway@stackpath.com>

repokitteh-read-only bot removed the waiting label Apr 29, 2021

jmarantz reviewed Apr 29, 2021

View reviewed changes

test/common/stats/thread_local_store_test.cc Show resolved Hide resolved

Updated thread_local_store_test to remove most add shutdownGlobalThre…

c29dc05

…ading calls. This way we cover both when it has been called, and when it has not Signed-off-by: Josh Tway <josh.tway@stackpath.com>

jmarantz reviewed Apr 30, 2021

View reviewed changes

jtway added 2 commits May 3, 2021 13:40

Minor clarifications to comments

aedf7bd

Signed-off-by: Josh Tway <josh.tway@stackpath.com>

Added isShutdown() method to ThreadLocal::Instance based on gist rece…

6325e98

…ived from jmarantz Signed-off-by: Josh Tway <josh.tway@stackpath.com>

jmarantz previously approved these changes May 7, 2021

View reviewed changes

repokitteh-read-only bot assigned snowp May 7, 2021

snowp suggested changes May 7, 2021

View reviewed changes

Merge remote-tracking branch 'upstream/main'

001d5f6

Signed-off-by: Josh Tway <josh.tway@stackpath.com>

jmarantz previously approved these changes May 17, 2021

View reviewed changes

snowp previously requested changes May 20, 2021

View reviewed changes

PR feedback change. Removing check that is no longer needed.

a488f0a

Signed-off-by: Josh Tway <josh.tway@stackpath.com>

jtway dismissed jmarantz’s stale review via a488f0a May 20, 2021 12:05

jtway added 2 commits May 20, 2021 10:34

Kick CI

42a504b

Signed-off-by: Josh Tway <josh.tway@stackpath.com>

Merge remote-tracking branch 'upstream/main'

e6aa9c2

Signed-off-by: Josh Tway <josh.tway@stackpath.com>

Merge remote-tracking branch 'upstream/main'

6b34a72

Signed-off-by: Josh Tway <josh.tway@stackpath.com>

jmarantz approved these changes May 21, 2021

View reviewed changes

jmarantz merged commit 7a639c4 into envoyproxy:main May 21, 2021

		// Capture all the pending histograms in a local, clearing the list held in
		// this. Once this occurs, if a new histogram is deleted, a new post will be

Conversation

jtway commented Apr 7, 2021

Uh oh!

repokitteh-read-only bot commented Apr 7, 2021

Uh oh!

jmarantz commented Apr 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jtway commented Apr 8, 2021

Uh oh!

jmarantz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jtway Apr 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jmarantz Apr 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jtway commented Apr 8, 2021

Uh oh!

jtway commented Apr 8, 2021

Uh oh!

jmarantz commented Apr 13, 2021

Uh oh!

jtway commented Apr 22, 2021

Uh oh!

jtway commented Apr 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jmarantz commented Apr 28, 2021

Uh oh!

Uh oh!

jmarantz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jmarantz left a comment

Choose a reason for hiding this comment

Uh oh!

repokitteh-read-only bot commented May 7, 2021

Uh oh!

jtway commented May 7, 2021

Uh oh!

snowp left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

jmarantz commented Apr 8, 2021 •

edited

Loading

jtway Apr 8, 2021 •

edited

Loading

jmarantz Apr 8, 2021 •

edited

Loading

jtway commented Apr 27, 2021 •

edited

Loading