Skip to content

stats: metric expiry#40395

Merged
wbpcode merged 12 commits intoenvoyproxy:mainfrom
kyessenov:stats_expiry
Aug 15, 2025
Merged

stats: metric expiry#40395
wbpcode merged 12 commits intoenvoyproxy:mainfrom
kyessenov:stats_expiry

Conversation

@kyessenov
Copy link
Copy Markdown
Contributor

@kyessenov kyessenov commented Jul 23, 2025

Change-Id: If3a45283b13cfda7d4f9a7bb661a1573f552ed7e
Commit Message: Introduce mark and sweep eviction of stale metrics in a stats scope.

Additional Description: The intended use case is the high cardinality metrics generated from the request data (e.g. Istio standard metrics). This in combination with the cardinality bounds (future PR) would ensure bounded metric resource usage. The algorithm works as follows:

  1. An "evictable" scope is allocated by a filter.
  2. A delta stats sink is configured, e.g. OTLP.
  3. At every flush interval, a scope metric that is used (e.g. has observed a data point) is marked as unused. A metric that has not been used is deleted from the central caches.
  4. A notification is sent to all workers to purge scope stale metrics from their thread-local caches.
  5. Once all workers complete, the unused metrics are purged from the allocator.

There are several edge conditions that need to be explained to validate correctness of this algorithm:

  1. A worker attempting to use a stale metric after (3) but before (4) might have its data lost. It will not be lost if 1) the same metric is recreated in the central cache by another worker since all metrics are uniquely indexed in the allocators; or 2) we implement deferred allocator deletions to await for the flush operation.

  2. A worker should not use a stored stale metric after (4). This requires that workers to not store the metrics by reference (hence, this solution will not work for most xDS metrics). Thread local cache references are always deleted before the storage is deleted.

  3. Histograms are handled slightly different because the parent histogram needs to be "merged" to observe usage, and clearing the usage requires updating all "children" histograms. Because we do this during flush, merging is always done first.

  4. A metric that is re-created after eviction would continue having its start time set as the original metric. This is a limitation of Envoy since it does not store the metric start times, but it is not an issue with delta aggregation in OTLP. Delta is the recommended protocol for handling high cardinality or sparse metric data. We could add start_time in a follow-up.

Risk Level: low, requires explicit usage
Testing: unit and a load test with Istio Proxy
Docs Changes: none
Release Notes: none

Change-Id: If3a45283b13cfda7d4f9a7bb661a1573f552ed7e
Signed-off-by: Kuat Yessenov <kuat@google.com>
@kyessenov kyessenov requested a review from paul-r-gall July 23, 2025 23:46
@kyessenov
Copy link
Copy Markdown
Contributor Author

Not sure who is the best person to review stats these days.
/assign-from @envoyproxy/senior-maintainers

@repokitteh-read-only
Copy link
Copy Markdown

@envoyproxy/senior-maintainers assignee is @mattklein123

🐱

Caused by: a #40395 (comment) was created by @kyessenov.

see: more, trace.

Change-Id: I6748662507d4b540076381379a26f53a924cb815
Signed-off-by: Kuat Yessenov <kuat@google.com>
@wbpcode
Copy link
Copy Markdown
Member

wbpcode commented Jul 24, 2025

I have a nit question. The mechanism that istio stats used should works perfectly, why this new PR is necessary?

/wait-any

@kyessenov
Copy link
Copy Markdown
Contributor Author

kyessenov commented Jul 24, 2025

@wbpcode Several issues with Istio scope rotation:

  1. It uses an unsafe pointer when switching scopes, there is some inherent race that never occurs.
  2. It busts the fast data path thread local cache on a timer. Re-creation of these fast path stats causes more CPU spike and use than necessary.
  3. It is only effective when combined with the cardinality limiter, which we would have to add directly to Scope (an overflow metric). It makes sense to do both expiry and limiter in the same place.

I plan to switch Istio to per-metric expiry with a single shared bounded scope rather than a per-scope expiry. That will give better controls over cardinality. If each filter holds a separate scope, we can't properly bound overall metric size.

Change-Id: Ib8d17e94db3a92c211a00e506cf6ef2bf9066c5b
@wbpcode wbpcode self-assigned this Jul 25, 2025
@wbpcode
Copy link
Copy Markdown
Member

wbpcode commented Jul 25, 2025

@wbpcode Several issues with Istio scope rotation:

  1. It uses an unsafe pointer when switching scopes, there is some inherent race that never occurs.
  2. It busts the fast data path thread local cache on a timer. Re-creation of these fast path stats causes more CPU spike and use than necessary.
  3. It is only effective when combined with the cardinality limiter, which we would have to add directly to Scope (an overflow metric). It makes sense to do both expiry and limiter in the same place.

I plan to switch Istio to per-metric expiry with a single shared bounded scope rather than a per-scope expiry. That will give better controls over cardinality. If each filter holds a separate scope, we can't properly bound overall metric size.

hmmm, it's reasonable. Although I initially a little against to add this per metric lifetime management because I think it's complex and the benefit is not that big. But after check your code, I think seems it's fine. Will take a deep look at the code after a while.

cc @jmarantz cc @paul-r-gall any suggestion as the stats expert?

@jmarantz
Copy link
Copy Markdown
Contributor

jmarantz commented Jul 25, 2025 via email

@kyessenov
Copy link
Copy Markdown
Contributor Author

@jmarantz I do want to use scope to control the lifetime of metrics in it. But we also don't want to lose the benefit of thread local cached stats. This is meant to be used in a shared proxy (backends are applications and labels are derived from them, applications change through the life of the proxy). In a steady state, we don't want a forced penalty caused by the whole scope deletion, so recently used stats should remain in caches. I don't see another way to do it except extending the core system. We could utilize this approach in many places, e.g. Wasm (#14070), or Istio-like metrics (#30619).

From my reading of opencensus, they also track usage per stat and expire them. It seems necessary for any dynamic plugins which do not have a predefined set of metrics.

Change-Id: I9c50b09b748e3164f3520bbb0d506b4f0b4916e2
Signed-off-by: Kuat Yessenov <kuat@google.com>
Change-Id: I30ecee110600b394995aadecd4548107d283739e
@mattklein123
Copy link
Copy Markdown
Member

I don't see any reason why this won't work, but FWIW it seems pretty fragile to me in terms of how it relates to flushing. Why not just handle this logic during flushing? You can see if a metric has been used between the last flush and the current flush and then just get rid of it at that point? I've implemented something similar here: https://github.com/bitdriftlabs/shared-core/blob/main/bd-client-stats-store/src/lib.rs

@mattklein123
Copy link
Copy Markdown
Member

@kyessenov
Copy link
Copy Markdown
Contributor Author

kyessenov commented Jul 29, 2025 via email

@jmarantz
Copy link
Copy Markdown
Contributor

That change suggested by @mattklein123 seems to clarify things -- I was trying to (without reading the code) reason about a state where we've decided we might want to evict a stat, but want to leave it in the thread-local caches so it could be revived.

This also feels a bit like what @stevenzzzz did for lazy stats (#23921 and #27899 are a good starting point). He may want to look at this stuff.

I still have questions about desired semantics. Do you want some stats that have not been recently updated to be removed at some point? I assume all stats are looked via sinks or the prom endpoint; in other words nothing will remain unobserved.

@kyessenov
Copy link
Copy Markdown
Contributor Author

I still have questions about desired semantics. Do you want some stats that have not been recently updated to be removed at some point? I assume all stats are looked via sinks or the prom endpoint; in other words nothing will remain unobserved.

The desired semantics is to remove stats that have not been used in the last O(minutes) from memory. We are not going to use prometheus since it does not support delta observations, but in OTLP you can send only the difference from the last flush, so we don't need to keep aggregate counters indefinitely.

@kyessenov
Copy link
Copy Markdown
Contributor Author

This also feels a bit like what @stevenzzzz did for lazy stats (#23921 and #27899 are a good starting point). He may want to look at this stuff.

I think lazy creation of stats is somewhat orthogonal. We would always create stats on-demand based on traffic.

@jmarantz
Copy link
Copy Markdown
Contributor

When you say a stat has not been used you mean modified? Or in the OTLP flow is there also a way of registering observer interest in a subset of stats?

@kyessenov
Copy link
Copy Markdown
Contributor Author

When you say a stat has not been used you mean modified? Or in the OTLP flow is there also a way of registering observer interest in a subset of stats?

For counters, no increment happened, and for histograms, no samples produced. I'm working with the managed OTLP implementation, where Google backends will convert back to cumulative monarch time series (fake deltas). The key is that the state is aggregated outside the proxy.

@mattklein123
Copy link
Copy Markdown
Member

@kyessenov one other thing that you might consider is creating some entirely new dynamic_stats thing that somehow uses shared_ptr internally so then it would look pretty much exactly like the Rust code I have. Not sure how hard that would be to bolt on top but might be better in the long run. I'm sure there are many places that might use something like this.

Change-Id: Iccf33fd3a693118560a7adb359dbc340fff69d06
Change-Id: Iacf3f56f29baa7e09fe443001066ce27114f812b
Change-Id: I8ee694725cd151baf80b6760cf687117c502546a
Signed-off-by: Kuat Yessenov <kuat@google.com>
@kyessenov
Copy link
Copy Markdown
Contributor Author

@kyessenov one other thing that you might consider is creating some entirely new dynamic_stats thing that somehow uses shared_ptr internally so then it would look pretty much exactly like the Rust code I have. Not sure how hard that would be to bolt on top but might be better in the long run. I'm sure there are many places that might use something like this.

It's fairly difficult because of the pervasive use of stats macros. I think we can gradually transition scope-by-scope though. Updated the code to evict during flush.

Change-Id: Ia79d583992e9760b0ae14e691019249c27c7528c
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

/wait

Comment thread source/common/stats/allocator_impl.cc Outdated
void sub(uint64_t amount) override {
ASSERT(child_value_ >= amount);
ASSERT(used() || amount == 0);
// ASSERT(used() || amount == 0);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted. I changed it so that only evictable scope metrics are marked as unused.

Change-Id: Iec9a7db939baed48cfcc52119d331002c4d662fe
Signed-off-by: Kuat Yessenov <kuat@google.com>
@repokitteh-read-only repokitteh-read-only Bot added api and removed waiting labels Aug 5, 2025
@repokitteh-read-only
Copy link
Copy Markdown

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/).
envoyproxy/api-shepherds assignee is @mattklein123
CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).

🐱

Caused by: #40395 was synchronize by kyessenov.

see: more, trace.

Change-Id: I11d3b854d1197d90482c7f8ef81768555edcd337
Signed-off-by: Kuat Yessenov <kuat@google.com>
@kyessenov
Copy link
Copy Markdown
Contributor Author

Added changelog and an API to enable this feature. Tested with a modified Istio Proxy that puts a random number into the Istio mesh metrics. Run it for 10 minutes with 64 cores and a load of 300qps, allocated memory use during the load is ~70MB, after the load drops down to ~15MB.

Change-Id: Ie035c25d3900eab69b4317649e28f12c32189daa
Signed-off-by: Kuat Yessenov <kuat@google.com>
@kyessenov
Copy link
Copy Markdown
Contributor Author

@wbpcode are you good with merging this?

@jmarantz
Copy link
Copy Markdown
Contributor

I just wanted to say that I looked over this and it looks really well done, thanks @kyessenov !

I'm just wondering about expected usage model and avoiding holding stats by reference.

@kyessenov
Copy link
Copy Markdown
Contributor Author

I just wanted to say that I looked over this and it looks really well done, thanks @kyessenov !

I'm just wondering about expected usage model and avoiding holding stats by reference.

The filter implementation would immediately apply a counter operation, e.g. https://github.com/istio/proxy/blob/master/source/extensions/filters/http/istio_stats/istio_stats.cc#L752. We can add a macro or special convenience functions, but we can't hide existing interfaces unfortunately.

@wbpcode wbpcode merged commit 45c3745 into envoyproxy:main Aug 15, 2025
26 checks passed
@ringerc
Copy link
Copy Markdown

ringerc commented Dec 3, 2025

Looks like this landed in v1.36.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants