Skip to content

stats: Factor out creation of cluster-stats StatNames from creation of the stats, to save CPU during xDS updates#14028

Merged
jmarantz merged 25 commits intoenvoyproxy:masterfrom
jmarantz:cluster-stats
Dec 8, 2020
Merged

stats: Factor out creation of cluster-stats StatNames from creation of the stats, to save CPU during xDS updates#14028
jmarantz merged 25 commits intoenvoyproxy:masterfrom
jmarantz:cluster-stats

Conversation

@jmarantz
Copy link
Contributor

@jmarantz jmarantz commented Nov 15, 2020

Commit Message: Adds a new set of macros in stat_macros.h for separately declaring the names of stats, to hold in a factory, and then for creating the stats based on the symbolized names. This removes a symbol-table contention bottleneck for stats that are created by name during operation. Two such instances are resolved in this PR: Cluster stats and Route filter stats.
Additional Description: There were two infrastructural issues expanding the file-count in this PR, but hopefully not the conceptual complexity:

  1. the addition of Router::Context and plumbing that through various factories and their mocks
  2. the saving of StatNames in mock structures that are subsequently used elsewhere requires that a global symbol table be used. This was not difficult as there was already a mechanism for a test-only global symbol table, but it turned out to be easiest to move the declaration of the global symbol table structure from test/mocks/stats/mocks.h to test/common/stats/stat_test_utility.h. Doing this avoided switching from IsolatedStoreImpl to the mock version, which has some subtle functional changes.

Risk Level: low
Testing: //test/...
Docs Changes: will update stats.md as part of this PR
Release Notes: n/a
Platform Specific Features: n/a

Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
@jmarantz
Copy link
Contributor Author

jmarantz commented Nov 15, 2020

Hey Matt -- PTAL briefly at this new pattern of using macros to declare StatNames once, then the actual counters, gauges, etc referencing the StatNames. If you think this direction is reasonable I can flesh out with a more complete implementation and specific tests for the new macros. I just wanted to show a proof-of-concept, handling counters and gauges, and not histograms or text-readouts or unit tests for all that yet.

I believe this will save some CPU in xDS handling in cases with large cluster sizes. There may be other places where this is needed.

This really needs a benchmark to prove it will help, which is tracked in #14005 .

See also #8831 which addresses the same issue. The regex overhead that I think that one saves (or at least provides the pattern to save) is quite obvious on the flame graphs we've seen, but this is showing up as well.

Signed-off-by: Joshua Marantz <jmarantz@google.com>
#define MAKE_STATS_STRUCT_STATNAME_HELPER_(name)
#define GENERATE_STATNAME_STRUCT(name)

/**
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that MAKE_STAT_NAMES_STRUCT and MAKE_STATS_STRUCT do not follow the pattern of the existing macros as they require distinct initializers based on data constructed from the ctor params. There's also a bit of hackery where the HELPER_ functions above depend on formal-param names established in other macros. I couldn't figure out how to make something like your FINISH_DECL macros work for this.

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a high level this makes sense to me. LMK when ready to review!

/wait

: scope_(scope) \
ALL_STATS(MAKE_STATS_STRUCT_COUNTER_HELPER_, MAKE_STATS_STRUCT_GAUGE_HELPER_, \
MAKE_STATS_STRUCT_STATNAME_HELPER_) {} \
Envoy::Stats::Scope& scope_; \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does holding the scope reference here actually do anything?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. this was done for syntactic purposes only as I needed the macro to have a leading comma. I should add comment about that unfortunate hack.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wound up replacing the scope here (which we do not need) with a ref to the StatNames struct, which turns out to be useful (though not essential).

Signed-off-by: Joshua Marantz <jmarantz@google.com>
@jmarantz
Copy link
Contributor Author

Thanks @mattklein123 -- I may back-burner this for a little while as I went back to the flame-graph and am not convinced yet whether the overhead from the original macros is material. Also I was able to get this to work ok for one example, but I don't have a good answer yet for how to structure the class and macros to elegantly handle the 15 possible combinations of a subsystem needing counters, gauges, histograms, or text-readouts.

I will circle back after we get some flame-graphs with a hot regex converted to RE2 (#8831 merged now) and maybe also a CDS speed test (#14005)

Signed-off-by: Joshua Marantz <jmarantz@google.com>
…d locks in worker threads on updates.

Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
… doesn't have to be resymbolized each time Router::FilterConfig is instantiated.

Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
@jmarantz jmarantz marked this pull request as ready for review December 2, 2020 17:04
@jmarantz
Copy link
Contributor Author

jmarantz commented Dec 2, 2020

@mattklein123 this is worth a pass. I should add more info about the macros to stats.md but I can do this in parallel to taking a look at the code.

I was motivated to pick this up again by a flame graph we found in a high-thread-count system that exposed a bottleneck on counterFromString on the main thread in Router::FilterConfig's stats.

Signed-off-by: Joshua Marantz <jmarantz@google.com>
*
* @return the stat names.
*/
virtual const Router::RouterStatNames& routerStatNames() const PURE;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we hang the routerStatNames off the clusterManagerFactory because we happen to have that available in the place where we construct the router's FilterConfig. It's arguably not too clean. I could also just add a RouterContext and plumb that through the chain of constructors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a cleanliness/tech debt perspective, we are trying to remove protocol specific stuff from cluster manager (@alyssawilk just started on some of this in terms of moving hard coded HTTP logic out to extensions).

It seems to me the best way to do this would be create a singleton somewhere where the router config is created? Then it would just be created once and lookup?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I have a better place to put it: the Http::Context. I thought of that because of your comment about 'hard coded HTTP logic". Does that seem reasonable?

And I have an Http::Context available to me where I needed the router stat names without excess plumbing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very router filter specific though. Can't it just be a singleton provided by the router filter factory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per slack, this is not the router filter, it was in source/common/router.

I earlier had this PR in a state where I had put the names needed for Router::FilterStats into Http::Context, and that was pretty straightforward, but wasn't great for separation. Now I have a Router::Context. Plumbing that through was trivial but involved a lot of file edits due to parallel universes for mocks, validation servers, integration test servers, and whatnot.

LMK what you think; I'm OK trying this out as a prod singleton too, but I think it would involve a few implicit dependencies that are now explicit (if verbose). For example in every test method we'd need to re-init the singleton with the test's symbol table.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sorry I had forgotten the AsyncClient angle to this. Given that this is basically part of the core I'm fine putting it wherever you think is easiest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a bit cleaner to have the router stats in their won context, though it expanded the number of files in the PR a bit.

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, flushing out a few comments.

/wait

Comment on lines +364 to +366
* ClusterManager. This helps factor out some relatively heavy name
* construction which occur when there is a large CDS update during operation,
* relative to recreating all stats from strings on-the-fly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't these be Cluster/upstream stats and not cluster manager stats which I think would be global and created once?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, will rename.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted the ClusterManager stats to their previous state.

*
* @return the stat names.
*/
virtual const Router::RouterStatNames& routerStatNames() const PURE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a cleanliness/tech debt perspective, we are trying to remove protocol specific stuff from cluster manager (@alyssawilk just started on some of this in terms of moving hard coded HTTP logic out to extensions).

It seems to me the best way to do this would be create a singleton somewhere where the router config is created? Then it would just be created once and lookup?

Signed-off-by: Joshua Marantz <jmarantz@google.com>
…uter stat names to pass to async clients.

Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
…ocess.

Signed-off-by: Joshua Marantz <jmarantz@google.com>
…able.

Signed-off-by: Joshua Marantz <jmarantz@google.com>
@jmarantz
Copy link
Contributor Author

jmarantz commented Dec 4, 2020

I think this is now ready for another round.

Signed-off-by: Joshua Marantz <jmarantz@google.com>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks LGTM with small questions and comments.

/wait

…ent.* naming works.

Signed-off-by: Joshua Marantz <jmarantz@google.com>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@jmarantz
Copy link
Contributor Author

jmarantz commented Dec 8, 2020

Thanks for the review of this PR that evolved into having so many files!

@jmarantz jmarantz merged commit 8188e23 into envoyproxy:master Dec 8, 2020
@jmarantz jmarantz deleted the cluster-stats branch December 8, 2020 01:39
mpuncel added a commit to mpuncel/envoy that referenced this pull request Dec 8, 2020
* master: (41 commits)
  event: Remove a source of non-determinism by always running deferred deletion before post callbacks (envoyproxy#14293)
  Fix TSAN bug in integration test (envoyproxy#14327)
  tracing: Add hostname to Zipkin config.  (envoyproxy#14186) (envoyproxy#14187)
  [conn_pool] fix use after free in H/1 connection pool (envoyproxy#14220)
  lua: update deprecated lua_open to luaL_newstate (envoyproxy#14297)
  extension: use bool_flag to control extension link (envoyproxy#14240)
  stats: Factor out creation of cluster-stats StatNames from creation of the stats, to save CPU during xDS updates (envoyproxy#14028)
  test: add scaled timer integration test (envoyproxy#14290)
  [Win32 Signals] Add term and ctrl-c signal handlers (envoyproxy#13954)
  config: v2 transport API fatal-by-default. (envoyproxy#14223)
  matcher: fix UB bug caused by dereferencing a bad optional (envoyproxy#14271)
  test: putting fake upstream config in a struct (envoyproxy#14266)
  wasm: use Bazel rules from Proxy-Wasm Rust SDK. (envoyproxy#14292)
  docs: fix typo (envoyproxy#14237)
  dependencies: allowlist CVE-2018-21270 to prevent false positives. (envoyproxy#14294)
  typo in redis doc (envoyproxy#14248)
  access_loggers: removed redundant dep (envoyproxy#14274)
  fix http2 flaky test (envoyproxy#14261)
  test: disable flaky xds_integration_test. (envoyproxy#14287)
  http: add functionality to configure kill header in KillRequest proto (envoyproxy#14288)
  ...

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
jmarantz added a commit that referenced this pull request Dec 10, 2020
)

Signed-off-by: Joshua Marantz <jmarantz@google.com>

Commit Message: Capture StatNames in the factory for extra categories of cluster stats. This is a follow-up to #14028 which established the stat-macro infrastructure needed to make this easy most of the time.
Additional Description: This still leaves behind the stats for circuit breakers, which are difficult to convert to the new stat-name macro-creation structure due to some pretty creative use of the existing macros. Resolving that can be a follow-up, which I think might require abandoning the macros for circuit-breaker stats and just using direct API calls.
Risk Level: low
Testing: //test/...
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants