Stats: Filter stats to be flushed to sinks#18805
Conversation
Signed-off-by: Pradeep Rao <pcrao@google.com>
Signed-off-by: Pradeep Rao <pcrao@google.com>
Signed-off-by: Pradeep Rao <pcrao@google.com>
|
Is there an associated ticket/discussion around this change? Wondering when you'd use this over the existing stats exclusion list which would also reduce the amount of stats being passed to the sink |
|
Pradeep has a doc, and I think there's an issue this PR can be associated with. In my mind there are three reasons to have this mechanism on top of disabling stats:
Pradeep can you link in the related stuff to the PR description? |
source/server/server.h
Outdated
|
|
||
| void setSinkPredicates(std::unique_ptr<Envoy::Stats::SinkPredicates> sink_predicates) override { | ||
| sink_predicates_ = std::move(sink_predicates); | ||
| stats_store_.setSinkPredicates(*sink_predicates_); |
There was a problem hiding this comment.
per above I think we should store this in the allocator.
Or is the issue that we don't have a stats allocator at the time this is called?
There was a problem hiding this comment.
I've stored it in the ThreadLocalStore, as we will need it there for histograms later.
There was a problem hiding this comment.
could we leave it in the allocator and reference it from thread_local_store? You could in theory make more than one thread_local_store on the same allocator and then you'd have a bit of a pickle :)
Signed-off-by: Pradeep Rao <pcrao@google.com>
Signed-off-by: Pradeep Rao <pcrao@google.com>
jmarantz
left a comment
There was a problem hiding this comment.
This is looking generally great. I was thinking we should peel off the histograms to a separate PR as there are some deeper things to discuss there, and it's kind of orthogonal to most of this PR.
In particular, I think we have concerns that if we gather histograms and don't sink them, they'll grow in memory without bound. We might need to look deeper into the implementation or do some memory tests to see if that's a concern.
There were still a few unaddressed comments from the previous review.
Finally, you will want to run the format fixer next time you push so you can get CI to run.
envoy/stats/allocator.h
Outdated
| virtual void forEachTextReadout(std::function<void(std::size_t)> f_size, | ||
| std::function<void(Stats::TextReadout&)> f_stat) const PURE; | ||
|
|
||
| virtual void forEachSinkedCounter(std::function<void(std::size_t)> f_size, |
There was a problem hiding this comment.
optional: wdyt of having using nicknames for all these function types? They occur a lot and are quite verbose. We may, in the future, want to change the signature and it would be slightly less annoying if we have nicknames. E.g. we might decide to pass the SharedPtr variants or we might want to have the function return a bool to indicate whether to continue. I'm not suggesting we do that now; it just would be easier to change in the future if we have the nickname.
envoy/stats/store.h
Outdated
| virtual void forEachHistogram(std::function<void(std::size_t)> f_size, | ||
| std::function<void(Stats::ParentHistogram&)> f_stat) const PURE; | ||
| /** | ||
| * Iterate over all stats that need to be flushed for sink. |
There was a problem hiding this comment.
My older comments were actually not correct :-( Fixed.
envoy/stats/store.h
Outdated
| /** | ||
| * Set the predicates to filter stats for sink. | ||
| */ | ||
| virtual void setSinkPredicates(std::unique_ptr<SinkPredicates> sink_predicates) PURE; |
There was a problem hiding this comment.
I think I understand why this is legal, but in these scenarios elsewhere in Envoy we specify && here. I did not see any instances of interfaces where we pass a unique_ptr by value in an interface.
| } | ||
|
|
||
| TEST_F(AllocatorImplTest, ForEachSinkedCounter) { | ||
|
|
| } | ||
|
|
||
| TEST_F(AllocatorImplTest, ForEachSinkedGauge) { | ||
|
|
Signed-off-by: Pradeep Rao <pcrao@google.com>
Signed-off-by: Pradeep Rao <pcrao@google.com>
|
compile failure from linux release build: /opt/llvm/bin/../include/c++/v1/memory:1419:19: error: invalid application of 'sizeof' to an incomplete type 'Envoy::Stats::SinkPredicates' You can just make 'bazel test -c opt //test/...' part of your process to avoid those :) |
Signed-off-by: Pradeep Rao <pcrao@google.com>
Signed-off-by: Pradeep Rao <pcrao@google.com>
jmarantz
left a comment
There was a problem hiding this comment.
Looks like this is heading in the right direction. Feel free to issue the PR to add the histogram support in parallel.
Can you ping me again when you've addressed all the outstanding comments and got through CI? Thanks!
/wait
Signed-off-by: Pradeep Rao <pcrao@google.com>
Signed-off-by: Pradeep Rao <pcrao@google.com>
Signed-off-by: Pradeep Rao <pcrao@google.com>
Signed-off-by: Pradeep Rao <pcrao@google.com>
Signed-off-by: Pradeep Rao <pcrao@google.com>
jmarantz
left a comment
There was a problem hiding this comment.
ping when all other comments are also resolved. thanks!
/wait
| server.registerCallback(ServerLifecycleNotifier::Stage::ShutdownExit, [] { FAIL(); }); | ||
| server.registerCallback(ServerLifecycleNotifier::Stage::ShutdownExit, | ||
| [](Event::PostCb) { FAIL(); }); | ||
| server.setSinkPredicates(std::unique_ptr<Stats::SinkPredicates>()); |
There was a problem hiding this comment.
s/unique_ptr/make_unique/ in this scenario, usually.
Signed-off-by: Pradeep Rao <pcrao@google.com>
Signed-off-by: Pradeep Rao <pcrao@google.com>
There was a problem hiding this comment.
Per @mattklein123, as this API is only available to custom C++ builds, I think we don't have a strong need to adjust user-level doc for the admin console in this PR.
But it is worth remembering that when we do add an API to make this functionality availability to stock envoy-static, we'll also need to doc that the hot-restart and admin behavior is affected by sinked stats.
Other than other minor nits this LGTM!
| alloc_.forEachCounter([&num_counters](std::size_t size) { num_counters = size; }, | ||
| [&num_iterations, &stat_names](Stats::Counter& counter) { | ||
| EXPECT_EQ(stat_names.count(counter.statName()), 1); | ||
| EXPECT_NE(stat_names.find(counter.statName()), stat_names.end()); |
There was a problem hiding this comment.
What you have is fine, but I actually think previous state of this code, with the calls to count with EXPECT_EQ to 1 would produce a more useful error message if they fail. stat_names.end() will not have an especially readable printout.
Up to you.
Signed-off-by: Pradeep Rao <pcrao@google.com>
Signed-off-by: Pradeep Rao <pcrao@google.com>
|
/retest |
|
Retrying Azure Pipelines: |
|
@ggreenway does this look good now, assuming Pradeep follows up with the API version of this and doc explaining the limitations for hot-restart? |
ggreenway
left a comment
There was a problem hiding this comment.
Looks good overall
/wait-any
envoy/server/instance.h
Outdated
| virtual bool enableReusePortDefault() PURE; | ||
|
|
||
| /** | ||
| * Set predicates for filtering stats to be flushed to sinks. |
There was a problem hiding this comment.
Does this apply to histograms? My guess is no (because they must be sinked to work properly, I believe). If that's the case, please document that.
There was a problem hiding this comment.
Modified comment to indicate this works for counters, gauges and text readouts.
I will add the ability to set predicates for histograms in a follow up PR. That will only affect which histograms are available in the MetricSnapshot. The histogram merge behavior will be unchanged.
Signed-off-by: Pradeep Rao <pcrao@google.com>
Signed-off-by: Pradeep Rao <pcrao@google.com>
* main: Stats: Filter stats to be flushed to sinks (envoyproxy#18805) build(deps): bump sphinx from 4.3.0 to 4.3.1 in /tools/base (envoyproxy#19122) ext-authz: fix missing UAEX flag on Denied CheckResponse (envoyproxy#18965) lua: support body setBytes with header content length set automatically (envoyproxy#18989) Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Commit Message: Additional Description: This is a followup to #18805 that adds the ability to filter histograms to be flushed to sink. Note that this only affects which histograms are flushed and does not change which histograms are merged during every flush operation. This is part 1 after splitting #19166 into 2. Risk Level: Low Testing: Added test Docs Changes: N/A Release Notes: N/A Platform Specific Features: N/A Signed-off-by: Pradeep Rao <pcrao@google.com>
Commit Message: Additional Description: This is a followup to envoyproxy#18805 that adds the ability to filter histograms to be flushed to sink. Note that this only affects which histograms are flushed and does not change which histograms are merged during every flush operation. This is part 1 after splitting envoyproxy#19166 into 2. Risk Level: Low Testing: Added test Docs Changes: N/A Release Notes: N/A Platform Specific Features: N/A Signed-off-by: Pradeep Rao <pcrao@google.com> Signed-off-by: Josh Perry <josh.perry@mx.com>
Commit Message: Additional Description: This is a followup to #18805 that adds the ability to filter histograms to be flushed to sink. Note that this only affects which histograms are flushed and does not change which histograms are merged during every flush operation. Risk Level: Low Testing: Added Tests Docs Changes: N/A Release Notes: included. Platform Specific Features: N/A Signed-off-by: Pradeep Rao <pcrao@google.com>
…9166) Commit Message: Additional Description: This is a followup to envoyproxy#18805 that adds the ability to filter histograms to be flushed to sink. Note that this only affects which histograms are flushed and does not change which histograms are merged during every flush operation. Risk Level: Low Testing: Added Tests Docs Changes: N/A Release Notes: included. Platform Specific Features: N/A Signed-off-by: Pradeep Rao <pcrao@google.com> Signed-off-by: VishalDamgude <vishal.damgude@freshworks.com>
Provide the ability to filter stats to be flushed to sinks to reduce CPU usage for the periodic stats aggregation process.
Commit Message:
Additional Description:
Envoy stats are periodically flushed to stat sinks (default cadence of 5s) on the main thread. The number of stats scales linearly with the number of clusters, as approximately 100 stats are replicated for each cluster. For high counts of clusters (order of 10k), the flushing of stats dominates CPU usage on the main thread. Being tied up in stats flushing can prevent the main thread from processing xDS updates in a timely manner, or even starve worker threads of CPU if the CPU is overcommitted.
Usually, the number of stats of interest can be an order of magnitude lower than the number of stats. There is a mechanism to reject unwanted stats, but doing so will also make them unavailable for viewing in the admin console, which could hinder debuggability. Further, Envoy actually needs some of its stats to run (see for eg. #14610) which is currently an open bug.
See the design doc below for more details:
https://docs.google.com/document/d/1lzMvRlU5xY0yezpqA75N6kU747GY7I_WeGpBXiPaP5M/edit#heading=h.xgjl2srtytjt
Risk Level: Low
Testing: Added tests
Docs Changes: NA
Release Notes: NA
Platform Specific Features: NA
See below benchmark results from //test/server:server_stats_flush_benchmark