Skip to content

format

6726d83
Select commit
Loading
Failed to load commit list.
Merged

stats: metric expiry #40395

format
6726d83
Select commit
Loading
Failed to load commit list.
CI (Envoy) / Mobile/ASAN skipped Aug 5, 2025 in 0s

Check was skipped

This check was not triggered in this CI run

Details

Request (pr/40395/main@6726d83)

kyessenov @kyessenov 6726d83 #40395 merge main@413c63f

stats: metric expiry

Change-Id: If3a45283b13cfda7d4f9a7bb661a1573f552ed7e
Commit Message: Introduce mark and sweep eviction of stale metrics in a stats scope.

Additional Description: The intended use case is the high cardinality metrics generated from the request data (e.g. Istio standard metrics). This in combination with the cardinality bounds (future PR) would ensure bounded metric resource usage. The algorithm works as follows:

  1. An "evictable" scope is allocated by a filter.
  2. A delta stats sink is configured, e.g. OTLP.
  3. At every flush interval, a scope metric that is used (e.g. has observed a data point) is marked as unused. A metric that has not been used is deleted from the central caches.
  4. A notification is sent to all workers to purge scope stale metrics from their thread-local caches.
  5. Once all workers complete, the unused metrics are purged from the allocator.

There are several edge conditions that need to be explained to validate correctness of this algorithm:

  1. A worker attempting to use a stale metric after (3) but before (4) might have its data lost. It will not be lost if 1) the same metric is recreated in the central cache by another worker since all metrics are uniquely indexed in the allocators; or 2) we implement deferred allocator deletions to await for the flush operation.

  2. A worker should not use a stored stale metric after (4). This requires that workers to not store the metrics by reference (hence, this solution will not work for most xDS metrics). Thread local cache references are always deleted before the storage is deleted.

  3. Histograms are handled slightly different because the parent histogram needs to be "merged" to observe usage, and clearing the usage requires updating all "children" histograms. Because we do this during flush, merging is always done first.

  4. A metric that is re-created after eviction would continue having its start time set as the original metric. This is a limitation of Envoy since it does not store the metric start times, but it is not an issue with delta aggregation in OTLP. Delta is the recommended protocol for handling high cardinality or sparse metric data. We could add start_time in a follow-up.

Risk Level: low, requires explicit usage
Testing: unit and a load test with Istio Proxy
Docs Changes: none
Release Notes: none

Environment

Request variables

Key Value
ref daf6694
sha 6726d83
pr 40395
base-sha 413c63f
actor kyessenov @kyessenov
message stats: metric expiry...
started 1754433981.816183
target-branch main
trusted false
Build image

Container image/s (as used in this CI run)

Key Value
default envoyproxy/envoy-build-ubuntu:f4a881a1205e8e6db1a57162faf3df7aed88eae8
mobile envoyproxy/envoy-build-ubuntu:mobile-f4a881a1205e8e6db1a57162faf3df7aed88eae8
Version

Envoy version (as used in this CI run)

Key Value
major 1
minor 36
patch 0
dev true