stats: metric expiry #40395
stats: metric expiry #40395
Check was skipped
This check was not triggered in this CI run
Details
Request (pr/40395/main@6726d83)
@kyessenov
6726d83 #40395
merge main@413c63f
stats: metric expiry
Change-Id: If3a45283b13cfda7d4f9a7bb661a1573f552ed7e
Commit Message: Introduce mark and sweep eviction of stale metrics in a stats scope.Additional Description: The intended use case is the high cardinality metrics generated from the request data (e.g. Istio standard metrics). This in combination with the cardinality bounds (future PR) would ensure bounded metric resource usage. The algorithm works as follows:
- An "evictable" scope is allocated by a filter.
- A delta stats sink is configured, e.g. OTLP.
- At every flush interval, a scope metric that is used (e.g. has observed a data point) is marked as unused. A metric that has not been used is deleted from the central caches.
- A notification is sent to all workers to purge scope stale metrics from their thread-local caches.
- Once all workers complete, the unused metrics are purged from the allocator.
There are several edge conditions that need to be explained to validate correctness of this algorithm:
A worker attempting to use a stale metric after (3) but before (4) might have its data lost. It will not be lost if 1) the same metric is recreated in the central cache by another worker since all metrics are uniquely indexed in the allocators; or 2) we implement deferred allocator deletions to await for the flush operation.
A worker should not use a stored stale metric after (4). This requires that workers to not store the metrics by reference (hence, this solution will not work for most xDS metrics). Thread local cache references are always deleted before the storage is deleted.
Histograms are handled slightly different because the parent histogram needs to be "merged" to observe usage, and clearing the usage requires updating all "children" histograms. Because we do this during flush, merging is always done first.
A metric that is re-created after eviction would continue having its start time set as the original metric. This is a limitation of Envoy since it does not store the metric start times, but it is not an issue with delta aggregation in OTLP. Delta is the recommended protocol for handling high cardinality or sparse metric data. We could add start_time in a follow-up.
Risk Level: low, requires explicit usage
Testing: unit and a load test with Istio Proxy
Docs Changes: none
Release Notes: none
Environment
Request variables
| Key | Value |
|---|---|
| ref | daf6694 |
| sha | 6726d83 |
| pr | 40395 |
| base-sha | 413c63f |
| actor | |
| message | stats: metric expiry... |
| started | 1754433981.816183 |
| target-branch | main |
| trusted | false |
Build image
Container image/s (as used in this CI run)
| Key | Value |
|---|---|
| default | envoyproxy/envoy-build-ubuntu:f4a881a1205e8e6db1a57162faf3df7aed88eae8 |
| mobile | envoyproxy/envoy-build-ubuntu:mobile-f4a881a1205e8e6db1a57162faf3df7aed88eae8 |
Version
Envoy version (as used in this CI run)
| Key | Value |
|---|---|
| major | 1 |
| minor | 36 |
| patch | 0 |
| dev | true |