stats: Add a new Counter Group metric type by RyanTheOptimist · Pull Request #16535 · envoyproxy/envoy

RyanTheOptimist · 2021-05-17T23:56:44Z

stats: Add a new type of metric, a CounterGroup, to represent a group of related counters.

Risk Level: Low
Testing: Various units tests
Docs Changes: N/A
Release Notes: N/A
Platform Specific Features: N/A

…ated counters.. Signed-off-by: Ryan Hamilton <rch@google.com>

Signed-off-by: Ryan Hamilton <rch@google.com>

RyanTheOptimist · 2021-05-17T23:57:24Z

/assign @jmarantz

RyanTheOptimist · 2021-05-17T23:58:04Z

@jmarantz this was a bit of a slog but I think I've got all the right pieces plumbed in. More than likely, though, I missed something or did part of this wrong :)

jmarantz

Epic! Did a quick pass and this generally looks great.

WDYT about keeping the name descriptors per earlier discussion? We'd have to manage those in the factory, so I realize it's a bit of complexity. But in the use-case you shared earlier it seemed named fields would be more useful than indexes.

include/envoy/stats/allocator.h

include/envoy/stats/stats.h

jmarantz · 2021-05-18T01:50:04Z

source/common/stats/allocator_impl.cc

+  size_t maxEntries() const override { return max_entries_; }
+
+private:
+  const size_t max_entries_;


per earlier emails, I was thinking rather than spending this 8 bytes on max_entries_, we could instead have a reference to a CounterGroupDescriptor, which would basically be a shared vector<std::string> or vector<StatName> of some sort.

This would not take any more memory, but would allow human-readable printing of stats in the admin page, and enable existing counters that are instantiated together to someday switch to this.

Good question! For the usecase I had in mind I basically would like a map from enum to counter and since an enum is an int, this would be fine. I'd expect to do something like group.inc(ENUM_FOO) or group.inc(ENUM_BAR).

That being said, I'm happy to consider carrying around the names if that's useful. I'm not quite sure that I understand the model you're thinking of... but considering I'm new to stats, this is not shocking :) When you say a "shared vector", what would the names be shared between? (Obviously entry A and entry B in the group have different names, but that's not what you're suggesting). With the "reference to a shared descriptor" approach, is the idea that instead of passing in "size_t max_entries" to all the various find(), make() from() methods we would instead pass in this shared descriptor? Would you envision the shared descriptor would be stored in some global variable that the callers of these methods could find?

I think group.inc(ENUM_BAR) is a fine model. But I was thinking about how this gets rendered in the Envoy admin console.

I'm also picturing that -- especially if you attach these quic counter-groups to hosts or clusters, there may be 10s of thousands or 100s of thousands of instances of them. But for the Quic counter-group there would only be one vector of strings holding the human-readable names.

I think the cleanest approach is that to allocate a CounterGroup you'd first need to allocate a CounterGroupDescriptor which would be constructed with a list of names, and would. hold an internal ref-count (like other stat objects). The caller would be responsible for holding onto the CounterGroupDescriptor to pass in every time a CounterGroup was allocated. If there was a QuicContext or similar available this descriptor could be put there.

I could also imagine an approach where you pass in the vector to the CounterGroup constructor, and it would do a hash lookup in the Stat::Allocator to see if that descriptor already exists so it can share it. This would be easier to use, but would involve more lookups and holding global locks longer, so I think it'd be better to use the explicit descriptor approach.

Oh, I see! I keep thinking of metrics in the context of Chrome where there is one instance of a particular histogram, not the Envoy model where there are (potentially) N copies of a metric. Ok, that makes sense. I'll create, as you say, a CounterGroupDescriptor which will contain a vector of names Should that be the full counter name, or just the suffix? I'm guessing just the suffix, but if it's the full counter name would we use that instead of the name in the symbol table as MetricImplBase uses?

I think the CounterGroupDescriptor should contain the leaf names only, IMO, so the descriptor can be shared across multiple instances associated with different clusters/hosts etc.

Then they can be concatenated during admin /stats rendering and propagation to stats sinks (e.g. Prometheus).

I haven't figured out whether StatName should be used for the leaf names, or we should just hold them in strings. We'll have to see what call-sites look like. For consistency I'd probably use StatName.

Ok, I think I have this basically working. Please take a look.

I also notice that I get test failures in //test/integration:stats_integration_test:

[ RUN ] IpVersions/ClusterMemoryTestRunner.MemoryLargeClusterSize/IPv6 test/integration/stats_integration_test.cc:291: Failure Expected: (m_per_cluster) <= (40000), actual: 40164 vs 40000 Stack trace: 0x2a1a9c2: Envoy::(anonymous namespace)::ClusterMemoryTestRunner_MemoryLargeClusterSize_Test::TestBody() 0x5c971e4: testing::internal::HandleSehExceptionsInMethodIfSupported<>() 0x5c8716e: testing::internal::HandleExceptionsInMethodIfSupported<>() 0x5c6f7b3: testing::Test::Run() 0x5c703ad: testing::TestInfo::Run() ... Google Test internal frames ...

I don't think I understand why this is happening though. I would have expected this PR to have added a new metric type, but not added any instances of it?

I haven't had a chancel to look deeply yet but I'd guess this is due to adding new maps for each ThreadLocalStore::ScopeImpl.

Ah! Yes, that makes sense. Thanks! Ok, I've update the test.

jmarantz · 2021-05-18T01:51:14Z

source/common/stats/allocator_impl.cc

+    // used(). From a system perspective this should be eventually consistent.
+    values_[index] += amount;
+    pending_values_[index] += amount;
+    flags_ |= Flags::Used;


it's worth considering whether we might want to replicate the 'used' bit per entry.

I wondered about that... What are the semantics of the used bit? Like, how is it consumed typically?

I think it's used for transferring stats across hot-restart generations. Maybe it's fine to have one used-bit for all the stats in the group. But we should socialize this with maintainers.

Sounds good. What's the best way to go about socializing this?

I'd make a github issue describing the reasons for doing this and we can ask to comment on it in Slack and also bring it up in the community meeting, which happens every 2 weeks. The next meeting is tuesday 5/25 @ 12pm eastern.

I guess the other thing this reminds me of: do we want these stats to be transferred across hot-restart generations? I'd guess the answer is 'yes', that's what would happen if they were counters.

So we'll need to add explicit code to do the RPC of the groups across the boundary, and consider corner cases like the set of counters changing between generations.

Signed-off-by: Ryan Hamilton <rch@google.com>

RyanTheOptimist

Thanks for the quick feedback!

include/envoy/stats/allocator.h

include/envoy/stats/stats.h

RyanTheOptimist · 2021-05-18T02:26:27Z

source/common/stats/allocator_impl.cc

+    // used(). From a system perspective this should be eventually consistent.
+    values_[index] += amount;
+    pending_values_[index] += amount;
+    flags_ |= Flags::Used;


I wondered about that... What are the semantics of the used bit? Like, how is it consumed typically?

RyanTheOptimist · 2021-05-18T02:43:56Z

source/common/stats/allocator_impl.cc

+  size_t maxEntries() const override { return max_entries_; }
+
+private:
+  const size_t max_entries_;


Good question! For the usecase I had in mind I basically would like a map from enum to counter and since an enum is an int, this would be fine. I'd expect to do something like group.inc(ENUM_FOO) or group.inc(ENUM_BAR).

That being said, I'm happy to consider carrying around the names if that's useful. I'm not quite sure that I understand the model you're thinking of... but considering I'm new to stats, this is not shocking :) When you say a "shared vector", what would the names be shared between? (Obviously entry A and entry B in the group have different names, but that's not what you're suggesting). With the "reference to a shared descriptor" approach, is the idea that instead of passing in "size_t max_entries" to all the various find(), make() from() methods we would instead pass in this shared descriptor? Would you envision the shared descriptor would be stored in some global variable that the callers of these methods could find?

jmarantz · 2021-05-18T12:29:45Z

I think it would also be good to proceed in parallel simply using a vector<Counter&> for your use-case. That will use more memory than a CounterGroup, but as I indicated (perhaps too subtly) in email, having even a CounterGroup for every Host even when Quic is not in use (or these stats are not needed operationally) is probably too expensive to have unconditional, and we'll need a runtime flag.

If we can land CounterGroup then that will reduce the memory overhead, but probably not by enough to avoid needing the runtime flag.

WDYT?

RyanTheOptimist

Thanks for the thoughtful discussion! This is super interesting!

I'm happy to proceed with a vector<Counter&> ("Counter&" or "Counter"? If it's a ref, where would the underlying counter be owned?) while we sort out the details of the Group api/impl. I'll move ahead with that. IIRC, you had an idea for a pattern to use for runtime guarding the initialization of metrics? If you could point me at it, that would be awesome!

RyanTheOptimist · 2021-05-18T15:47:29Z

source/common/stats/allocator_impl.cc

+  size_t maxEntries() const override { return max_entries_; }
+
+private:
+  const size_t max_entries_;


Oh, I see! I keep thinking of metrics in the context of Chrome where there is one instance of a particular histogram, not the Envoy model where there are (potentially) N copies of a metric. Ok, that makes sense. I'll create, as you say, a CounterGroupDescriptor which will contain a vector of names Should that be the full counter name, or just the suffix? I'm guessing just the suffix, but if it's the full counter name would we use that instead of the name in the symbol table as MetricImplBase uses?

RyanTheOptimist · 2021-05-18T15:48:21Z

source/common/stats/allocator_impl.cc

+    // used(). From a system perspective this should be eventually consistent.
+    values_[index] += amount;
+    pending_values_[index] += amount;
+    flags_ |= Flags::Used;


Sounds good. What's the best way to go about socializing this?

jmarantz

Stats are generally created from a Stats::Scope and are held by reference in filters or configs, and the stats last as long as the scope. This is despite the fact that the underlying allocator ref-counts them.

One place where optional stats are instantiated is here:

envoy/source/common/upstream/upstream_impl.cc

Line 756 in ea4cadc

    
           optional_cluster_stats_((config.has_track_cluster_stats() || config.track_timeout_budgets())

As you follow this trail of popcorn, you'l see that it has to use somewhat more complicated set of macros than the simple COUNTER(foo) style.

jmarantz · 2021-05-19T00:04:39Z

source/common/stats/allocator_impl.cc

+  size_t maxEntries() const override { return max_entries_; }
+
+private:
+  const size_t max_entries_;


I think the CounterGroupDescriptor should contain the leaf names only, IMO, so the descriptor can be shared across multiple instances associated with different clusters/hosts etc.

Then they can be concatenated during admin /stats rendering and propagation to stats sinks (e.g. Prometheus).

I haven't figured out whether StatName should be used for the leaf names, or we should just hold them in strings. We'll have to see what call-sites look like. For consistency I'd probably use StatName.

jmarantz · 2021-06-02T00:05:08Z

/wait

Signed-off-by: Ryan Hamilton <rch@google.com>

jmarantz

It's probably worth having @mattklein123 and @snowp look at this in terms of general direction for the stats system.

jmarantz · 2021-06-07T18:04:17Z

source/server/admin/stats_handler.cc

+  for (const Stats::CounterGroupSharedPtr& counter : server_.stats().counterGroups()) {
+    if (shouldShowMetric(*counter, used_only, regex)) {
+      for (size_t i = 0; i < counter->maxEntries(); ++i) {
+        all_stats.emplace(absl::StrCat(counter->name(), counter->nameSuffix(i)), counter->value(i));


should we add a "." in between the suffix name and the counter-group name?

this makes me wonder also whether there's a test that hits this, where you might have noticed the lack of "."

Oh, interesting. I was thinking that'd go into the suffix so we could use "_" or "." but maybe that's not useful. Ok, I changed this code to add "." and augmented the StatsHandlerTest to actually test the handlerStats() method directly. (This also led me to refactor statsHandler a bit()). What do you think?

jmarantz · 2021-06-07T18:04:36Z

source/server/admin/stats_handler.cc

    }
  }

+  for (const Stats::CounterGroupSharedPtr& counter : server_.stats().counterGroups()) {


suggest 'cgroup' rather than 'counter' for locals of this type?

Oh, good point! (Though I chose counter_group to comply with the style guide's advice to minimize abbreviation)

jmarantz · 2021-06-07T18:06:26Z

source/server/admin/stats_handler.cc

+    if (shouldShowMetric(*counter, used_only, regex)) {
+      for (size_t i = 0; i < counter->maxEntries(); ++i) {
+        all_stats.emplace(absl::StrCat(counter->name(), counter->nameSuffix(i)), counter->value(i));
+      }


just a quick note that this will explode memory during operation. There's an ongoing Issue (#16139) and some experimental PRs to address, which would affect this new loop:

Ah, good point. Shall I leave this code as-is until/unless something is merged before my PR?

Signed-off-by: Ryan Hamilton <rch@google.com>

RyanTheOptimist

Thanks!

RyanTheOptimist · 2021-06-07T18:55:21Z

source/server/admin/stats_handler.cc

    }
  }

+  for (const Stats::CounterGroupSharedPtr& counter : server_.stats().counterGroups()) {


Oh, good point! (Though I chose counter_group to comply with the style guide's advice to minimize abbreviation)

RyanTheOptimist · 2021-06-07T22:41:52Z

source/server/admin/stats_handler.cc

+    if (shouldShowMetric(*counter, used_only, regex)) {
+      for (size_t i = 0; i < counter->maxEntries(); ++i) {
+        all_stats.emplace(absl::StrCat(counter->name(), counter->nameSuffix(i)), counter->value(i));
+      }


Ah, good point. Shall I leave this code as-is until/unless something is merged before my PR?

RyanTheOptimist · 2021-06-07T22:45:05Z

source/server/admin/stats_handler.cc

+  for (const Stats::CounterGroupSharedPtr& counter : server_.stats().counterGroups()) {
+    if (shouldShowMetric(*counter, used_only, regex)) {
+      for (size_t i = 0; i < counter->maxEntries(); ++i) {
+        all_stats.emplace(absl::StrCat(counter->name(), counter->nameSuffix(i)), counter->value(i));


Oh, interesting. I was thinking that'd go into the suffix so we could use "_" or "." but maybe that's not useful. Ok, I changed this code to add "." and augmented the StatsHandlerTest to actually test the handlerStats() method directly. (This also led me to refactor statsHandler a bit()). What do you think?

RyanTheOptimist · 2021-06-07T23:18:14Z

It's probably worth having @mattklein123 and @snowp look at this in terms of general direction for the stats system.

Good point. Shall I do that once this looks good to you?

jmarantz · 2021-06-08T00:30:14Z

I think it would be good to get directional feedback from @snowp and @mattklein123 before ironing out the rest of these details. Actually we probably should've done that before you dived in; my bad for not having guided you that way. But basically at a high level this makes sense to me now.

Matt & Snow: the TLDR here I think is that this new aggregate stats structure has two benefits that I can see:

it helps the existing Quic code which increment stats by index to work naturally (though this PR is strictly not needed)
it is a more compact mechanism to represent a group of counters each of which have the same leaf name. This would save memory in other potential use-cases.

An alternative to this PR (which Ryan and I also discussed) is just to use a vector of Counter refs to work with Quic.

mattklein123

Is there a summary of a CounterGroup at a very high level? From reading just the headers I'm a little confused. It it a group of counters that all have the same name and then some numeric index attached to it? When would I want to use one? Like for HTTP status codes, etc.? Would it be possible to add a bit more info to the headers and/or update stats.md?

Assuming ^ is true it sounds reasonable to me at a high level, though I don't have a strong opinion yet on whether the added complexity to the stats system (which is already super complicated) is worth it vs. the alternative that was proposed (a vector of counters).

mattklein123 · 2021-06-08T03:43:03Z

envoy/stats/scope.h

+                                   CounterGroupDescriptorSharedPtr descriptor) PURE;
+
+  /**
+   * TODO(#6667): this variant is deprecated: use counterGroupFromStatName.


If it's deprecated can we avoid adding this for new things?

I wondered about this myself. I'm not quite sure how deprecated this variant is... it seems aspirational not actual. In particular, stats_macros.h uses it in the various POOL_*_PREFIX macros. Maybe someone has a plan to get rid of uses this method eventually but hasn't done so yet? I'm just cargo-culting from existing metric code so perhaps someone with more background in the metrics knows more?

mattklein123 · 2021-06-08T03:44:20Z

envoy/upstream/upstream.h

 * All cluster stats. @see stats_macros.h
 */
-#define ALL_CLUSTER_STATS(COUNTER, GAUGE, HISTOGRAM, TEXT_READOUT, STATNAME)                       \
+#define ALL_CLUSTER_STATS(COUNTER, GAUGE, HISTOGRAM, TEXT_READOUT, COUNTER_GROUPS, STATNAME)       \


nit: IIRC you don't need to add this if there are no COUNTER_GROUPS below, but perhaps this has changed.

Hm, that doesn't seem to work for me when I removed "COUNTER_GROUP, " from this line:

./envoy/upstream/upstream.h:648:1: error: too many arguments provided to function-like macro invocation
MAKE_STAT_NAMES_STRUCT(ClusterStatNames, ALL_CLUSTER_STATS);

In particular, MAKE_STAT_NAMES_STRUCT does:

ALL_STATS(GENERATE_STAT_NAME_STRUCT, GENERATE_STAT_NAME_STRUCT, GENERATE_STAT_NAME_STRUCT, \ GENERATE_STAT_NAME_STRUCT, GENERATE_STAT_NAME_STRUCT, GENERATE_STAT_NAME_STRUCT) \

which seems to have 1 argument for each of the different metric types. Is there something different I should be doing?

jmarantz · 2021-06-08T12:08:48Z

Matt: the counter names would render with strings in the leaf, identical to normal counters. The indexes just how Quic's platform-independent layer references the counters. In this PR (per my suggestion) the CounterGroups include a Descriptor reference which is just an array of strings.

Relative to vector<Counter&> this PR provides several tradeoffs:

Pros:

Less memory used per counter

Cons:

The used bit is common to all members of the group (this could be changed)
disabling stats through stat-matcher could only work for the entire group

If this does look good we can potentially change existing blocks of counters to use this mechanism to save memory, if desired.

RyanTheOptimist · 2021-06-08T14:38:45Z

Is there a summary of a CounterGroup at a very high level? From reading just the headers I'm a little confused. It it a group of counters that all have the same name and then some numeric index attached to it? When would I want to use one? Like for HTTP status codes, etc.? Would it be possible to add a bit more info to the headers and/or update stats.md?

Assuming ^ is true it sounds reasonable to me at a high level, though I don't have a strong opinion yet on whether the added complexity to the stats system (which is already super complicated) is worth it vs. the alternative that was proposed (a vector of counters).

Just to provide a bit of background, I'll explain the motivating use case. In Chromium, we have a metric called a "Enum Histogram". This differs from the Envoy histogram in that it is simply a map from an enum value to a count. This is commonly used when the cardinality of the enum is low. For HTTP/3, there is an important metric that we use this mechanism for. For every request which could possible have been sent over HTTP/3, we record an entry in such a map about why it did or didn't go over HTTP/3. The enum values are (from memory):

NO_RACE (a HTTP/3 connection was already available)
WON_RACE (no HTTP/3 connection was available so we raced QUIC and TCP, and QUIC won)
LOST_RACE (no HTTP/3 connection was available so we raced QUIC and TCP, and TCP won)
BROKEN (HTTP/3 was marked as broken and was not attempted)
MAP_MISSING (No alt-svc information was available for this server when the request was made, but the response included an alt-svc header advertising HTTP/3. If we had this info when we started the request we would have used HTTP/3).

It would be nice to have this information in Envoy. When I discussed this use case with @jmarantz he proposed this "counter group" approach as a more efficient representation than a collection (vector) of Counter metrics.

Does this help? I'd be happy to expand the comment for CounterGroup if that would help?

mattklein123 · 2021-06-11T02:41:57Z

Does this help? I'd be happy to expand the comment for CounterGroup if that would help?

Yes very helpful, thanks. If we decide to stick with this approach I think adding some more comments would be helpful.

My question though is what is the main benefit of this over a vector of stats? Is it really just the memory saving? How much memory do we save? I'm mostly just trying to understand if the extra complexity is worthwhile.

jmarantz · 2021-06-17T13:39:25Z

Yeah I think the next step is to quantify the memory benefit of this over just using a vector<counter&>. I think you can look at some of the tests in thread_local_store_test.cc like MemoryWithoutTlsRealSymbolTable, and make versions that use vector<counter&> vs this mechanism to quantify the difference.

I think in those context of those bounded tests, memory usage has proved byte-deterministic enough on some platforms to make the EQ versions work without flaking.

/wait

Signed-off-by: Ryan Hamilton <rch@google.com>

RyanTheOptimist · 2021-06-18T18:07:05Z

Yeah I think the next step is to quantify the memory benefit of this over just using a vector<counter&>. I think you can look at some of the tests in thread_local_store_test.cc like MemoryWithoutTlsRealSymbolTable, and make versions that use vector<counter&> vs this mechanism to quantify the difference.

I think in those context of those bounded tests, memory usage has proved byte-deterministic enough on some platforms to make the EQ versions work without flaking.

/wait

Ok, I've taken a stab at writing such a test. But I can't get it to fail, at least not locally. I think the issue is that I'm not doing a "canonical" memory build but I used bazel.release via docker CI so I'm not sure what I did wrong. Any suggestions? In any case, I tried a test with 10 counters vs a counter group of size 10.

[ RUN ] StatsThreadLocalStoreTestNoFixture.MemoryWithoutTlsRealSymbolTableFor10Counters
mem: 2080
[ OK ] StatsThreadLocalStoreTestNoFixture.MemoryWithoutTlsRealSymbolTableFor10Counters (6 ms)
[ RUN ] StatsThreadLocalStoreTestNoFixture.MemoryWithoutTlsRealSymbolTableFor10CounterGroup
mem: 1256
[ OK ] StatsThreadLocalStoreTestNoFixture.MemoryWithoutTlsRealSymbolTableFor10CounterGroup (4 ms)

I'm not sure if I'm reading this right or if the test setup is right, but it looks like the CounterGroup uses about 60% of the memory?

Signed-off-by: Ryan Hamilton <rch@google.com>

jmarantz · 2021-06-21T17:19:31Z

envoy/stats/allocator.h

+   */
+  virtual CounterGroupSharedPtr makeCounterGroup(StatName name, StatName tag_extracted_name,
+                                                 const StatNameTagVector& stat_name_tags,
+                                                 CounterGroupDescriptorSharedPtr descriptor) PURE;


Thinking about the usage model, I'm wondering where. the descriptors will come from.

I think it's probably OK to be unopinionated at this level. You could have the caller create a descriptor and then make a CounterGroup from that.

However what we want to avoid is having multiple copies of the same group, or you lose the benefit of memory savings.

If you leave this in the hands of the caller, I'm wondering whether (eg) Quic will have an appropriate context that is instantiated once per process, in which these descriptors can be explicitly instantiated and referenced. That would in some sense be ideal.

However if Quic (or other possible use-cases for this) doesn't have such a context, we might want to delegate the management of the descriptors to the Stats::Allocator. In that case, we could provide Stats::Allocator::makeCounterGroupDescriptor(...) which would take a string-list and return a canonical descriptor for that. That would need to take a lock on a the allocator's descriptor-cache, and hopefully that would not be overly contended.

Maybe this is worth a TODO here in allocator.h? WDYT?

My idea was that we'd have a global function like:

CounterGroupDescriptorSharedPtr getQuicCounterGroupDescriptor() { static CounterGroupDescriptorSharedPtr descriptor = std::make_shared<CounterGroupDescriptorImpl>({"foo", "bar", "baz"}); return descriptor; }

This way we allocate a single descriptor which is accessible to any place that want to create a counter group with the same set of suffixes. Does that sound plausible?

My personal preference would be for that to come out of a context. I think also we might want to use StatName rather than string as the representation for the enum names within the descriptor, so that they could be sent to StatName::join() without taking a symbol table lock.

Doing that, though, means you need to pass a SymbolTable& to the descriptor constructor.

And then wouldn't be able to use a static because in tests, symbol-tables would be re-built with every test method, and you wouldn't be able to destruct the symbol-table for one test method while a static held ownership of live StatNames built it.

Interesting. I don't see StatName::join() in the code, but I did see SymbolTable::join(). Is that the method you're referring to? Can you say more about where this method is called from? In particular, with this PR the CounterGroup::nameSuffix() currently returns an absl::string_view. Would it need to return a StatName?

jmarantz · 2021-06-21T19:04:00Z

I think to get a better idea of what the impact overall in memory is, we could consider hypothetically the conversion of the counters in Cluster Stats to a CounterGroup.

In envoy/upstream/upstream.h there is a macro ALL_CLUSTER_STATS with 71 counters, 12 gauges, and 2 histograms. If we converted the 71 counters to a single CounterGroup, what would be the per-cluster savings? What would be the savings overall with 10k clusters?

Note that stats_integration_test.cc for a while tracked the exact per-cluster memory usage, and it's in the neighborhood of 45-50k bytes per cluster. If we can drop a significant amount of that, then. I think this change could be material (though it might be a bunch of work).

@RyanTheOptimist do you want to do the math for that scenario?

@pradeepcrao @stevenzzzz

jmarantz · 2021-06-22T13:29:06Z

/wait

RyanTheOptimist

In envoy/upstream/upstream.h there is a macro ALL_CLUSTER_STATS with 71 counters, 12 gauges, and 2 histograms. If we converted the 71 counters to a single CounterGroup, what would be the per-cluster savings?

Ok, so I ran the two tests I mentioned earlier but with 71 counters instead of 10. The memory usage is:
Counters: 14.1K
Group: 7.8K

So that's a savings of 6.3K per cluster.

What would be the savings overall with 10k clusters?

If I've done the math right, 10K clusters saving 6.3K per cluster would save a total of 63M.

Note that stats_integration_test.cc for a while tracked the exact per-cluster memory usage, and it's in the neighborhood of 45-50k bytes per cluster. If we can drop a significant amount of that, then. I think this change could be material (though it might be a bunch of work).

Looks like it's in the vicinity of 10-15%. Would that be significant, do you think?

RyanTheOptimist · 2021-06-25T17:24:11Z

envoy/stats/allocator.h

+   */
+  virtual CounterGroupSharedPtr makeCounterGroup(StatName name, StatName tag_extracted_name,
+                                                 const StatNameTagVector& stat_name_tags,
+                                                 CounterGroupDescriptorSharedPtr descriptor) PURE;


My idea was that we'd have a global function like:

CounterGroupDescriptorSharedPtr getQuicCounterGroupDescriptor() { static CounterGroupDescriptorSharedPtr descriptor = std::make_shared<CounterGroupDescriptorImpl>({"foo", "bar", "baz"}); return descriptor; }

This way we allocate a single descriptor which is accessible to any place that want to create a counter group with the same set of suffixes. Does that sound plausible?

jmarantz

I think that is significant but would like to get the opinion of @snowp and @mattklein123 also.

Note this wouldn't come for free; it would involve a fair amount of munging of the code that references those stats, so there's a question of whether that's lower hanging fruit for memory reduction, vs other possible ideas like trimming down the classes for clusters, or loading them lazily, etc.

jmarantz · 2021-06-25T17:40:23Z

envoy/stats/allocator.h

+   */
+  virtual CounterGroupSharedPtr makeCounterGroup(StatName name, StatName tag_extracted_name,
+                                                 const StatNameTagVector& stat_name_tags,
+                                                 CounterGroupDescriptorSharedPtr descriptor) PURE;


My personal preference would be for that to come out of a context. I think also we might want to use StatName rather than string as the representation for the enum names within the descriptor, so that they could be sent to StatName::join() without taking a symbol table lock.

Doing that, though, means you need to pass a SymbolTable& to the descriptor constructor.

And then wouldn't be able to use a static because in tests, symbol-tables would be re-built with every test method, and you wouldn't be able to destruct the symbol-table for one test method while a static held ownership of live StatNames built it.

mattklein123 · 2021-06-28T03:16:44Z

I think that is significant but would like to get the opinion of @snowp and @mattklein123 also.

Seems like a nice improvement but I am worried about the code churn to make use of it. I guess my quick take is that unless we are willing to use this new type in a lot of places and do the work it's probably not worth the added complexity but I don't feel very strongly about it.

github-actions · 2021-07-28T04:01:10Z

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

github-actions · 2021-08-04T08:01:25Z

This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

RyanTheOptimist added 4 commits May 17, 2021 23:52

Add a new type of metric, a CounterGroup, to represent a group of rel…

142f5ff

…ated counters.. Signed-off-by: Ryan Hamilton <rch@google.com>

Rename

a458c83

Signed-off-by: Ryan Hamilton <rch@google.com>

Done?

9ceac2f

Signed-off-by: Ryan Hamilton <rch@google.com>

format

53c26d6

Signed-off-by: Ryan Hamilton <rch@google.com>

repokitteh-read-only bot assigned jmarantz May 17, 2021

jmarantz reviewed May 18, 2021

View reviewed changes

Fix comments from jmarantz

2f3c7e9

Signed-off-by: Ryan Hamilton <rch@google.com>

RyanTheOptimist commented May 18, 2021

View reviewed changes

jmarantz reviewed May 19, 2021

View reviewed changes

repokitteh-read-only bot added the waiting label Jun 2, 2021

RyanTheOptimist added 4 commits June 4, 2021 20:57

Merge branch 'main' into CounterGroup

8a1a56c

Signed-off-by: Ryan Hamilton <rch@google.com>

Fix build issues post-merge

3699471

Signed-off-by: Ryan Hamilton <rch@google.com>

Descriptor

ae938ac

Signed-off-by: Ryan Hamilton <rch@google.com>

Descriptor

ab4bb65

Signed-off-by: Ryan Hamilton <rch@google.com>

repokitteh-read-only bot removed the waiting label Jun 4, 2021

RyanTheOptimist added 3 commits June 4, 2021 23:48

Fix

e5f3799

Signed-off-by: Ryan Hamilton <rch@google.com>

format

96b7bb5

Signed-off-by: Ryan Hamilton <rch@google.com>

Stats integration test

857a05f

Signed-off-by: Ryan Hamilton <rch@google.com>

jmarantz reviewed Jun 7, 2021

View reviewed changes

RyanTheOptimist added 2 commits June 7, 2021 23:06

Address comments from jmarantz

5f66fc4

Signed-off-by: Ryan Hamilton <rch@google.com>

Format

f07a458

Signed-off-by: Ryan Hamilton <rch@google.com>

RyanTheOptimist commented Jun 7, 2021

View reviewed changes

mattklein123 reviewed Jun 8, 2021

View reviewed changes

RyanTheOptimist mentioned this pull request Jun 9, 2021

stats: Add tests for StatsHandler::handlerStats() #16915

Merged

repokitteh-read-only bot added the waiting label Jun 17, 2021

RyanTheOptimist added 2 commits June 17, 2021 17:29

Merge branch 'main' into CounterGroup

6fb2b39

Signed-off-by: Ryan Hamilton <rch@google.com>

Memory usage tests

a1638b2

Signed-off-by: Ryan Hamilton <rch@google.com>

repokitteh-read-only bot removed the waiting label Jun 18, 2021

Use _LE in tests

46a6771

Signed-off-by: Ryan Hamilton <rch@google.com>

jmarantz reviewed Jun 21, 2021

View reviewed changes

repokitteh-read-only bot added the waiting label Jun 22, 2021

RyanTheOptimist commented Jun 25, 2021

View reviewed changes

jmarantz reviewed Jun 25, 2021

View reviewed changes

github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Jul 28, 2021

github-actions bot closed this Aug 4, 2021

Conversation

RyanTheOptimist commented May 17, 2021

Uh oh!

RyanTheOptimist commented May 17, 2021

Uh oh!

RyanTheOptimist commented May 17, 2021

Uh oh!

jmarantz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RyanTheOptimist left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jmarantz commented May 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RyanTheOptimist left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jmarantz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jmarantz commented Jun 2, 2021

Uh oh!

jmarantz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

jmarantz commented May 18, 2021 •

edited

Loading

jmarantz commented Jun 8, 2021 •

edited

Loading