Skip to content

Stats: Filter stats to be flushed to sinks#18805

Merged
jmarantz merged 25 commits intoenvoyproxy:mainfrom
pradeepcrao:stats_filter2
Nov 30, 2021
Merged

Stats: Filter stats to be flushed to sinks#18805
jmarantz merged 25 commits intoenvoyproxy:mainfrom
pradeepcrao:stats_filter2

Conversation

@pradeepcrao
Copy link
Copy Markdown
Contributor

@pradeepcrao pradeepcrao commented Oct 27, 2021

Provide the ability to filter stats to be flushed to sinks to reduce CPU usage for the periodic stats aggregation process.

Commit Message:

Additional Description:

Envoy stats are periodically flushed to stat sinks (default cadence of 5s) on the main thread. The number of stats scales linearly with the number of clusters, as approximately 100 stats are replicated for each cluster. For high counts of clusters (order of 10k), the flushing of stats dominates CPU usage on the main thread. Being tied up in stats flushing can prevent the main thread from processing xDS updates in a timely manner, or even starve worker threads of CPU if the CPU is overcommitted.

Usually, the number of stats of interest can be an order of magnitude lower than the number of stats. There is a mechanism to reject unwanted stats, but doing so will also make them unavailable for viewing in the admin console, which could hinder debuggability. Further, Envoy actually needs some of its stats to run (see for eg. #14610) which is currently an open bug.

See the design doc below for more details:
https://docs.google.com/document/d/1lzMvRlU5xY0yezpqA75N6kU747GY7I_WeGpBXiPaP5M/edit#heading=h.xgjl2srtytjt

Risk Level: Low
Testing: Added tests
Docs Changes: NA
Release Notes: NA
Platform Specific Features: NA

See below benchmark results from //test/server:server_stats_flush_benchmark

----------------------------------------------------------------------------------
Benchmark                                        Time             CPU   Iterations
----------------------------------------------------------------------------------
bmFlushToSinks/10                            0.003 ms        0.003 ms       247626
bmFlushToSinks/100                           0.019 ms        0.019 ms        36474
bmFlushToSinks/1000                          0.193 ms        0.193 ms         3622
bmFlushToSinks/10000                          2.25 ms         2.25 ms          299
bmFlushToSinks/100000                         61.8 ms         61.8 ms           10
bmFlushToSinks/1000000                        1212 ms         1212 ms            1
bmFlushToSinksWithPredicatesSet/10           0.001 ms        0.001 ms       496056
bmFlushToSinksWithPredicatesSet/100          0.007 ms        0.007 ms       104775
bmFlushToSinksWithPredicatesSet/1000         0.067 ms        0.067 ms        10411
bmFlushToSinksWithPredicatesSet/10000        0.704 ms        0.704 ms          938
bmFlushToSinksWithPredicatesSet/100000        28.0 ms         28.0 ms           25
bmFlushToSinksWithPredicatesSet/1000000        484 ms          484 ms            2

Signed-off-by: Pradeep Rao <pcrao@google.com>
Signed-off-by: Pradeep Rao <pcrao@google.com>
Signed-off-by: Pradeep Rao <pcrao@google.com>
@snowp
Copy link
Copy Markdown
Contributor

snowp commented Oct 28, 2021

Is there an associated ticket/discussion around this change? Wondering when you'd use this over the existing stats exclusion list which would also reduce the amount of stats being passed to the sink

@jmarantz
Copy link
Copy Markdown
Contributor

jmarantz commented Oct 29, 2021

Pradeep has a doc, and I think there's an issue this PR can be associated with.

In my mind there are three reasons to have this mechanism on top of disabling stats:

  • disabling stats can break functionality (there's a separate bug tracking this work)
  • disabling stats can, depending on the matcher-spec, still be somewhat expensive in memory/time
  • we may want to see some of the stats on individual envoys via admin even though we don't want to sink all 100*NumClusters stats into our monitoring infrastructure. There's separate bugs/draft PRs to make that possible even if there are 1M stats.

Pradeep can you link in the related stuff to the PR description?

Copy link
Copy Markdown
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/wait


void setSinkPredicates(std::unique_ptr<Envoy::Stats::SinkPredicates> sink_predicates) override {
sink_predicates_ = std::move(sink_predicates);
stats_store_.setSinkPredicates(*sink_predicates_);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per above I think we should store this in the allocator.

Or is the issue that we don't have a stats allocator at the time this is called?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've stored it in the ThreadLocalStore, as we will need it there for histograms later.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we leave it in the allocator and reference it from thread_local_store? You could in theory make more than one thread_local_store on the same allocator and then you'd have a bit of a pickle :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Signed-off-by: Pradeep Rao <pcrao@google.com>
Signed-off-by: Pradeep Rao <pcrao@google.com>
Copy link
Copy Markdown
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking generally great. I was thinking we should peel off the histograms to a separate PR as there are some deeper things to discuss there, and it's kind of orthogonal to most of this PR.

In particular, I think we have concerns that if we gather histograms and don't sink them, they'll grow in memory without bound. We might need to look deeper into the implementation or do some memory tests to see if that's a concern.

There were still a few unaddressed comments from the previous review.

Finally, you will want to run the format fixer next time you push so you can get CI to run.

virtual void forEachTextReadout(std::function<void(std::size_t)> f_size,
std::function<void(Stats::TextReadout&)> f_stat) const PURE;

virtual void forEachSinkedCounter(std::function<void(std::size_t)> f_size,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional: wdyt of having using nicknames for all these function types? They occur a lot and are quite verbose. We may, in the future, want to change the signature and it would be slightly less annoying if we have nicknames. E.g. we might decide to pass the SharedPtr variants or we might want to have the function return a bool to indicate whether to continue. I'm not suggesting we do that now; it just would be easier to change in the future if we have the nickname.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

virtual void forEachHistogram(std::function<void(std::size_t)> f_size,
std::function<void(Stats::ParentHistogram&)> f_stat) const PURE;
/**
* Iterate over all stats that need to be flushed for sink.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doxygen params for these

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My older comments were actually not correct :-( Fixed.

/**
* Set the predicates to filter stats for sink.
*/
virtual void setSinkPredicates(std::unique_ptr<SinkPredicates> sink_predicates) PURE;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand why this is legal, but in these scenarios elsewhere in Envoy we specify && here. I did not see any instances of interfaces where we pass a unique_ptr by value in an interface.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

}

TEST_F(AllocatorImplTest, ForEachSinkedCounter) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this blank line

}

TEST_F(AllocatorImplTest, ForEachSinkedGauge) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rm blank line

Copy link
Copy Markdown
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/wait

Signed-off-by: Pradeep Rao <pcrao@google.com>
Signed-off-by: Pradeep Rao <pcrao@google.com>
@jmarantz
Copy link
Copy Markdown
Contributor

jmarantz commented Nov 9, 2021

compile failure from linux release build:

/opt/llvm/bin/../include/c++/v1/memory:1419:19: error: invalid application of 'sizeof' to an incomplete type 'Envoy::Stats::SinkPredicates'

You can just make 'bazel test -c opt //test/...' part of your process to avoid those :)

Signed-off-by: Pradeep Rao <pcrao@google.com>
Signed-off-by: Pradeep Rao <pcrao@google.com>
Copy link
Copy Markdown
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is heading in the right direction. Feel free to issue the PR to add the histogram support in parallel.

Can you ping me again when you've addressed all the outstanding comments and got through CI? Thanks!

/wait

Signed-off-by: Pradeep Rao <pcrao@google.com>
Signed-off-by: Pradeep Rao <pcrao@google.com>
Signed-off-by: Pradeep Rao <pcrao@google.com>
Signed-off-by: Pradeep Rao <pcrao@google.com>
Signed-off-by: Pradeep Rao <pcrao@google.com>
Copy link
Copy Markdown
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping when all other comments are also resolved. thanks!

/wait

server.registerCallback(ServerLifecycleNotifier::Stage::ShutdownExit, [] { FAIL(); });
server.registerCallback(ServerLifecycleNotifier::Stage::ShutdownExit,
[](Event::PostCb) { FAIL(); });
server.setSinkPredicates(std::unique_ptr<Stats::SinkPredicates>());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/unique_ptr/make_unique/ in this scenario, usually.

Signed-off-by: Pradeep Rao <pcrao@google.com>
Signed-off-by: Pradeep Rao <pcrao@google.com>
jmarantz
jmarantz previously approved these changes Nov 21, 2021
Copy link
Copy Markdown
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per @mattklein123, as this API is only available to custom C++ builds, I think we don't have a strong need to adjust user-level doc for the admin console in this PR.

But it is worth remembering that when we do add an API to make this functionality availability to stock envoy-static, we'll also need to doc that the hot-restart and admin behavior is affected by sinked stats.

Other than other minor nits this LGTM!

alloc_.forEachCounter([&num_counters](std::size_t size) { num_counters = size; },
[&num_iterations, &stat_names](Stats::Counter& counter) {
EXPECT_EQ(stat_names.count(counter.statName()), 1);
EXPECT_NE(stat_names.find(counter.statName()), stat_names.end());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you have is fine, but I actually think previous state of this code, with the calls to count with EXPECT_EQ to 1 would produce a more useful error message if they fail. stat_names.end() will not have an especially readable printout.

Up to you.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Signed-off-by: Pradeep Rao <pcrao@google.com>
Signed-off-by: Pradeep Rao <pcrao@google.com>
@pradeepcrao
Copy link
Copy Markdown
Contributor Author

/retest

@repokitteh-read-only
Copy link
Copy Markdown

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #18805 (comment) was created by @pradeepcrao.

see: more, trace.

jmarantz
jmarantz previously approved these changes Nov 22, 2021
@jmarantz
Copy link
Copy Markdown
Contributor

@ggreenway does this look good now, assuming Pradeep follows up with the API version of this and doc explaining the limitations for hot-restart?

Copy link
Copy Markdown
Member

@ggreenway ggreenway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall

/wait-any

virtual bool enableReusePortDefault() PURE;

/**
* Set predicates for filtering stats to be flushed to sinks.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this apply to histograms? My guess is no (because they must be sinked to work properly, I believe). If that's the case, please document that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified comment to indicate this works for counters, gauges and text readouts.
I will add the ability to set predicates for histograms in a follow up PR. That will only affect which histograms are available in the MetricSnapshot. The histogram merge behavior will be unchanged.

Signed-off-by: Pradeep Rao <pcrao@google.com>
Signed-off-by: Pradeep Rao <pcrao@google.com>
@jmarantz jmarantz merged commit 55a97dd into envoyproxy:main Nov 30, 2021
mpuncel added a commit to mpuncel/envoy that referenced this pull request Nov 30, 2021
* main:
  Stats: Filter stats to be flushed to sinks (envoyproxy#18805)
  build(deps): bump sphinx from 4.3.0 to 4.3.1 in /tools/base (envoyproxy#19122)
  ext-authz: fix missing UAEX flag on Denied CheckResponse (envoyproxy#18965)
  lua: support body setBytes with header content length set automatically (envoyproxy#18989)

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
jmarantz pushed a commit that referenced this pull request Feb 2, 2022
Commit Message:
Additional Description:
This is a followup to #18805 that adds the ability to filter histograms to be flushed to sink. Note that this only affects which histograms are flushed and does not change which histograms are merged during every flush operation.

This is part 1 after splitting #19166 into 2.
Risk Level: Low
Testing: Added test
Docs Changes: N/A
Release Notes: N/A
Platform Specific Features: N/A

Signed-off-by: Pradeep Rao <pcrao@google.com>
joshperry pushed a commit to joshperry/envoy that referenced this pull request Feb 13, 2022
Commit Message:
Additional Description:
This is a followup to envoyproxy#18805 that adds the ability to filter histograms to be flushed to sink. Note that this only affects which histograms are flushed and does not change which histograms are merged during every flush operation.

This is part 1 after splitting envoyproxy#19166 into 2.
Risk Level: Low
Testing: Added test
Docs Changes: N/A
Release Notes: N/A
Platform Specific Features: N/A

Signed-off-by: Pradeep Rao <pcrao@google.com>
Signed-off-by: Josh Perry <josh.perry@mx.com>
jmarantz pushed a commit that referenced this pull request Jan 10, 2023
Commit Message:
Additional Description:
This is a followup to #18805 that adds the ability to filter histograms to be flushed to sink. Note that this only affects which histograms are flushed and does not change which histograms are merged during every flush operation.

Risk Level: Low
Testing: Added Tests
Docs Changes: N/A
Release Notes: included.
Platform Specific Features: N/A

Signed-off-by: Pradeep Rao <pcrao@google.com>
VishalDamgude pushed a commit to freshworks-oss/envoy that referenced this pull request Feb 1, 2023
…9166)

Commit Message:
Additional Description:
This is a followup to envoyproxy#18805 that adds the ability to filter histograms to be flushed to sink. Note that this only affects which histograms are flushed and does not change which histograms are merged during every flush operation.

Risk Level: Low
Testing: Added Tests
Docs Changes: N/A
Release Notes: included.
Platform Specific Features: N/A

Signed-off-by: Pradeep Rao <pcrao@google.com>
Signed-off-by: VishalDamgude <vishal.damgude@freshworks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants