stats: Improve performance of clearing scopes and histograms by batching them#15876
stats: Improve performance of clearing scopes and histograms by batching them#15876jmarantz merged 14 commits intoenvoyproxy:mainfrom
Conversation
…ing them. Previously a post was required per histogram or scope, per thread. This greatly reduces the overhead of large config updates and when tens of thousands of histograms and scopes are queued for release in short order. Co-authored-by: Joshua Marantz <jmarantz@google.com> Signed-off-by: Josh Tway <josh.tway@stackpath.com>
|
Hi @jtway, welcome and thank you for your contribution. We will try to review your Pull Request as quickly as possible. In the meantime, please take a look at the contribution guidelines if you have not done so already. |
|
Looks like you have an issue with the test for the main file you are modifying. You might want to iterate locally with: prior to pushing to CI. |
|
@jmarantz it didn't fail locally, but I did see this. A colleague was able to recreate, so I will be looking in the morning. |
jmarantz
left a comment
There was a problem hiding this comment.
It makes sense that this speeds things up significantly, but not by an order of magnitude, by saving a lot of indirect function calls. I don't think it reduce the amount of times we need to take a lock, but it does reduce the indirect function calls through post().
The fact that control-plane things take 10s of seconds though still seems bad to me, and makes me wonder if we need more system-thinking about this problem of scale.
| if (!shutting_down_ && main_thread_dispatcher_) { | ||
| const uint64_t scope_id = scope->scope_id_; | ||
| // Switch to batching of clearing scopes as mentioned in: | ||
| // https://gist.github.com/jmarantz/838cb6de7e74c0970ea6b63eded0139a |
There was a problem hiding this comment.
I think we don't need to call out the gist anymore since we have incorporated the code.
There was a problem hiding this comment.
Will remove as I find and fix why the integration test failed in the pipeline.
| auto central_caches = std::make_shared<std::vector<CentralCacheEntrySharedPtr>>(); | ||
| { | ||
| Thread::LockGuard lock(lock_); | ||
| scope_ids->swap(scopes_to_cleanup_); |
There was a problem hiding this comment.
I always liked this swap() pattern by @antoniovicente has convinced me it's faster to do:
*scope_ids = std::move(scopes_to_cleanup_);
scopes_to_cleanup_.clear();
as there is no chance in this context that *scope_ids had anything in it. You can use that in place of all the swap()s added in this PR.
Note you do need that clear() because std::move does not strictly guarantee what state it leaves the source in, but it should be a fast no-op.
There was a problem hiding this comment.
Interesting, never thought about using something like clear() to return something like this to a "known" state after std::move
There was a problem hiding this comment.
Yeah I believe that is the only legal operation you can do on an object that has just had its contents moved...other than destruct it.
There was a problem hiding this comment.
The pattern used elsewhere is:
*scope_ids = std::move(scopes_to_cleanup_);
ASSERT(scopes_to_cleanup_.empty());
envoy/source/common/event/dispatcher_impl.cc
Line 370 in c6c2a87
@jmarantz I agree the time control-plane things take still stands room for much improvement. I am viewing this as too problems: 1. Address how long large config updates take, 2. Address why a single VirtualHost being updated, when there are > 100K VirtualHosts present, ends up being a large config update. We are still looking into the later, but perhaps I should open an issue for that as well. In our particular case, it appears the problem comes from every VirtualHost being recreated even when a single VirtualHost is present in a VHDS update. Then the existing config being destroyed after the new config is propagated, which caused the VirtualHostImpl destructor to run, and in turn the ScopeImpl destructor, for every VirtualHost in the previous config. I could be wrong there, but that appears to be the case. |
|
@jmarantz getting a few compounding problems, even when trying the fix. How I couldn't recreate yesterday, I am unsure. I am in the process of moving, so it may take me a little time to make sure I get the right fix in here. I know the goal is to try to make sure these get wrapped up in < 7 days. Hopefully I will be able to wrap it up by then, but if not, I hope that is okay here. |
|
/wait |
|
Back from my PTO now, but a couple unrelated issues have needed my attention. I should be looking to wrap this up in the next couple days. I'm currently having a couple issues with the integration test due to the addition of runOnAllThreads in the shutdownThreading logic. |
… scopes and histograms by batching them Signed-off-by: Josh Tway <josh.tway@stackpath.com>
|
Looks like this broke other tests, guess that's what I get for focusing on this. The problem is that, usually shutdownGlobalThreading has been called, and that means we can't use runOnAllThreads. I'm thinking this through a little. |
|
ping again when ready /wait |
…en shutting threading down Signed-off-by: Josh Tway <josh.tway@stackpath.com>
Signed-off-by: Josh Tway <josh.tway@stackpath.com>
Signed-off-by: Josh Tway <josh.tway@stackpath.com>
…ading calls. This way we cover both when it has been called, and when it has not Signed-off-by: Josh Tway <josh.tway@stackpath.com>
| const uint64_t scope_id = scope->scope_id_; | ||
| // Switch to batching of clearing scopes. This greatly reduces the overhead when there are | ||
| // tens of thousands of scopes to clear in a short period. i.e.: VHDS updates with tens of | ||
| // thousands of VirtualHosts |
There was a problem hiding this comment.
nit: end sentences with period.
| // cache flush operation. | ||
| if (!shutting_down_ && main_thread_dispatcher_) { | ||
| const uint64_t scope_id = scope->scope_id_; | ||
| // Switch to batching of clearing scopes. This greatly reduces the overhead when there are |
There was a problem hiding this comment.
nit: s/Switch to batching of clearing scopes/Clear scopes in a batch/ as the reader of this won't be anchoring on the previous implementation.
Signed-off-by: Josh Tway <josh.tway@stackpath.com>
…ived from jmarantz Signed-off-by: Josh Tway <josh.tway@stackpath.com>
jmarantz
left a comment
There was a problem hiding this comment.
I think this looks good now, but I'm biased since I help write some of this. I think the big question here is whether the improved performance is justified by the increased complexity, but the benefits are outlined in the description. So that's a good start. We also have seen similar issues internally, so are hoping this will be submittable.
/assign-from @envoyproxy/senior-maintainers
|
@envoyproxy/senior-maintainers assignee is @snowp |
|
Thanks @jmarantz |
snowp
left a comment
There was a problem hiding this comment.
Thanks I think this batching logic makes sense. Can you explain why we have all these changes to shutdownThreading et al? Not clear from my first pass
| if (tls_.has_value()) { | ||
| ASSERT(tls_->isShutdown()); | ||
| } |
There was a problem hiding this comment.
maybe do ASSERT(!tls_.has_value() || tls_->isShutdown())? Right now this turns into an empty if block in NDEBUG builds (which would get optimized out, so it doesn't matter that much)
| } | ||
|
|
||
| // We can't call runOnAllThreads here as global threading has already been shutdown. | ||
| // It is okay to simply clear the scopes and central cache entries to cleanup. |
There was a problem hiding this comment.
Can you include an explanation for why this is okay to do?
There was a problem hiding this comment.
Will update. How is:
// We can't call runOnAllThreads here as global threading has already been shutdown. It is okay // to simply clear the scopes and central cache entries here as they will be cleaned up during thread local data // cleanup in InstanceImpl::shutdownThread().
| // Capture all the pending histograms in a local, clearing the list held in | ||
| // this. Once this occurs, if a new histogram is deleted, a new post will be |
There was a problem hiding this comment.
This comment is a bit hard to read because talking about this makes the sentence seem incomplete. Maybe rephrase? Something like Move the histograms pending cleanup into a local variable. Future histogram deletions will be batched until the next time this function is called.?
| for (auto& stat_name_storage : stat_names_) { | ||
| stat_name_storage->free(symbol_table_); | ||
| } | ||
| store_.shutdownThreading(); |
There was a problem hiding this comment.
What's the purpose of this change?
There was a problem hiding this comment.
In production code we were shutting down threading this way, where ThraedLocal::shutdownGlobalThreading() was called first. That means in shutdownThreading() we can't then go through the runOAllThreads sequence to clean up; we just have to remove the elements directly.
This change, and others in some tests, are to make tests shutdown threading in the same order as production code. The ASSERT(tls_->isShutdown()), validates the claim made below it:
// We can't call runOnAllThreads here as global threading has already been shutdown.
and a lot of the trivial test changes are to avoid having that assert crash in unit tests.
| EXPECT_THAT(expected_json, JsonStringEq(actual_json)); | ||
| store_->shutdownThreading(); | ||
| shutdownThreading(); | ||
| ENVOY_LOG_MISC(error, "end of StatsAsJson"); |
There was a problem hiding this comment.
I don't think this log is necessary
| // Clear scopes in a batch. This greatly reduces the overhead when there are | ||
| // tens of thousands of scopes to clear in a short period. i.e.: VHDS updates with tens of | ||
| // thousands of VirtualHosts. | ||
| bool need_post = scopes_to_cleanup_.empty(); |
There was a problem hiding this comment.
Mind including an explanation why we post when this is empty? I assume this is the batching mechanism so that we don't post while there is a post in progress, but some detail would be nice to have here
There was a problem hiding this comment.
updating releaseScopesCrossThread to have some of the same info as releaseHistogramCrossThread. This should cover this and the additional comment below. :)
| // clearHistogramsFromCaches. If a new histogram is deleted before that | ||
| // post runs, we add it to our list of histograms to clear, and there's no | ||
| // need to issue another post. |
There was a problem hiding this comment.
I see this is explained here, would be good to have this explanation for the above scope release :)
|
I'm clearly missing something with this latest failure. I'm not even seeing a multiplexed_integration_test. Only a multiplexed_upstream_integration_test |
|
can you try to merge main? |
|
Will do, been waiting for the bulk of review to finish prior to. Pretty sure that's in the contribution guidelines. If not, will keep things up to date in the future. |
|
The reason I suggested that is that I think some of the tests merge main automatically first, so they can report an error in a test that you cannot even see in your client. |
Signed-off-by: Josh Tway <josh.tway@stackpath.com>
|
Got an error because some don't have DCO sign-off, but they are part of what was merged from upstream. In this case it is okay to use the --no-verify right? |
|
Looks like that was all successful |
snowp
left a comment
There was a problem hiding this comment.
Just one small comment, otherwise this LGTM!
| // This will block both future cache fills as well as cache flushes. | ||
| shutting_down_ = true; | ||
|
|
||
| if (tls_.has_value()) { |
There was a problem hiding this comment.
Don't need this outer if check anymore
Signed-off-by: Josh Tway <josh.tway@stackpath.com>
|
Looks like it failed in a flake way. I'm going to kick it. |
Signed-off-by: Josh Tway <josh.tway@stackpath.com>
|
I did a kick-ci, but then a colleague informed me that there were some recent merges in main that might be needed, so I merged and pushed that. I hope the next go will be clean (after the current). |
|
You might need another main merge, I think main was broken earlier |
Signed-off-by: Josh Tway <josh.tway@stackpath.com>
|
Looks like I don't have the ability to rerun jobs, and not sure as to why the windows CI one is all of a sudden failing. Can some one help me out here? |
|
WIndows failed 3/7 times in a protoocol integration test. Re-running. I think you can also do this via: /retest |
|
Retrying Azure Pipelines: |
|
Good to know! For some reason I read that part of the documentation as comment in the code. |
|
Looks like the retest got kicked off before the previous finished or more results that I can not explain (as I have not seen them previously). /retest |
|
Retrying Azure Pipelines: |
|
all checks passed :) |
|
@snowp it is still showing that there is a change being requested, but I'm pretty sure I addressed everything you wanted. |
…ing them (envoyproxy#15876) Previously a post was required per histogram or scope, per thread. This greatly reduces the overhead of large config updates and when tens of thousands of histograms and scopes are queued for release in short order. This work is based off of https://gist.github.com/jmarantz/838cb6de7e74c0970ea6b63eded0139a Co-authored-by: Joshua Marantz jmarantz@google.com Signed-off-by: Josh Tway josh.tway@stackpath.com Additional Description: Came across the original patch by @jmarantz while investigating why updating a single VirtualHost, via VHDS, would frequently take 50 seconds, or more, to propagate. Turns out it was caused by the large number of Scopes being cleared for each update. Batching scopes and histogram ids queued for removal together improves this to under 20 seconds, often below 10 seconds. Risk Level: Low Testing: Before and after testing of how long it took VHDS, when > 100K VirtualHosts were present took to update. Docs Changes: N/A Release Notes: N/a Platform Specific Features: N/A Co-authored-by: Joshua Marantz <jmarantz@google.com> Signed-off-by: Josh Tway <josh.tway@stackpath.com>
Previously a post was required per histogram or scope, per thread. This greatly reduces the overhead of large config updates and when tens of thousands of histograms and scopes are queued for release in short order. This work is based off of https://gist.github.com/jmarantz/838cb6de7e74c0970ea6b63eded0139a
Co-authored-by: Joshua Marantz jmarantz@google.com
Signed-off-by: Josh Tway josh.tway@stackpath.com
Additional Description: Came across the original patch by @jmarantz while investigating why updating a single VirtualHost, via VHDS, would frequently take 50 seconds, or more, to propagate. Turns out it was caused by the large number of Scopes being cleared for each update. Batching scopes and histogram ids queued for removal together improves this to under 20 seconds, often below 10 seconds.
Risk Level: Low
Testing: Before and after testing of how long it took VHDS, when > 100K VirtualHosts were present took to update.
Docs Changes: N/A
Release Notes: N/a
Platform Specific Features: N/A