StatsAccessLogger: fixes connection gauge underflow crashes when decrementing metrics after Scope evictions. by TAOXUY · Pull Request #43812 · envoyproxy/envoy

TAOXUY · 2026-03-06T05:38:15Z

Description: Fixes connection gauge underflow crashes in the Stats Access Logger when decrementing metrics after Scope evictions.

The original code correctly attempted to prevent "zombie" gauges by re-resolving metrics against the central store (via scope_->gaugeFromStatNameWithTags) during request destruction. However, it tried to reconstruct the gauge's identity using gauge_->tagExtractedStatName(). This failed because dynamic access-log tags (like %REQUEST_HEADER(...)%) are not registered with Envoy's global extractors. The extraction process returned a mangled base name and empty tags, forcing Scope to create a new 0-valued gauge. Subtracting 1 from it immediately crashed Envoy with a counter underflow.

Fix: we will keep the gauge in the scope cache if it is non-zero

Risk Level: Low

Testing: Added StatsAccessLogIntegrationTest.ActiveRequestsGaugeScopeEviction, which synthetically forces an asynchronous scope eviction while a connection is still inflight. Verified that the gauge successfully decrements to 0 in the central store identically to a normal request finish.

Docs: NA

Release: NA

Platform Specific Features: no

Signed-off-by: Xuyang Tao <taoxuy@google.com>

ggreenway

I don't think your fix is quite right.

I ran the integration test you added without your code changes, and it fails in an assertion ASSERT(used() || amount == 0); in sub(). I think either the assertion is no longer valid in the case of evicted stats, or the stat is being set to unused incorrectly.

      if (scope->evictable_) {
        MetricBag metrics(scope->scope_id_);
        CentralCacheEntrySharedPtr& central_cache = scope->centralCacheMutableNoThreadAnalysis();
        auto filter_unused = []<typename T>(StatNameHashMap<T>& unused_metrics) {
          return [&unused_metrics](std::pair<StatName, T> kv) {
            const auto& [name, metric] = kv;
            if (metric->used()) {
              metric->markUnused();
              return false;
            } else {
              unused_metrics.try_emplace(name, metric);
              return true;
            }
          };
        };

The above code assumes that a stat is only ever held by a single scope (or other holder of a reference), which isn't correct. cc @kyessenov .

I think the use of std::min around all the sub() calls means that it's likely the counter could be incorrect. Even if this change prevents it from going negative, I think it is still an incorrect count.

/wait

When evicting unused stats from the central cache, we need to ensure that gauges actively referenced by components like AccessLogState are not evicted. The use_count() > 1 check prevents this, but a previous bug in evictUnused where the lambda parameter std::pair<StatName, T> kv was captured by value caused artificial inflation of the use_count due to the deep copy. This broke eviction entirely across the codebase. This commit fixes evictUnused by taking const auto& kv by reference, avoiding the deep copy and correctly applying the use_count() > 1 safeguard. Furthermore, AccessLogState now properly holds a GaugeSharedPtr in its State struct so its active references prevent premature eviction by evictUnused. The erroneous std::min safeguard during gauge subtractions is also removed as AccessLogState gauges will no longer be unfairly cleared. Signed-off-by: Xuyang Tao <taoxuy@google.com>

Signed-off-by: Xuyang Tao <taoxuy@google.com>

TAOXUY · 2026-03-08T19:47:29Z

I don't think your fix is quite right.

I ran the integration test you added without your code changes, and it fails in an assertion ASSERT(used() || amount == 0); in sub(). I think either the assertion is no longer valid in the case of evicted stats, or the stat is being set to unused incorrectly.
      if (scope->evictable_) {
        MetricBag metrics(scope->scope_id_);
        CentralCacheEntrySharedPtr& central_cache = scope->centralCacheMutableNoThreadAnalysis();
        auto filter_unused = []<typename T>(StatNameHashMap<T>& unused_metrics) {
          return [&unused_metrics](std::pair<StatName, T> kv) {
            const auto& [name, metric] = kv;
            if (metric->used()) {
              metric->markUnused();
              return false;
            } else {
              unused_metrics.try_emplace(name, metric);
              return true;
            }
          };
        };
The above code assumes that a stat is only ever held by a single scope (or other holder of a reference), which isn't correct. cc @kyessenov .

I think the use of std::min around all the sub() calls means that it's likely the counter could be incorrect. Even if this change prevents it from going negative, I think it is still an incorrect count.

/wait

Updated with a interface to not evict per metric. We need to keep gauge not evicted in the scope as that it can be looked-up and then dec/inc on the same gauge. @kyessenov

Signed-off-by: Xuyang Tao <taoxuy@google.com>

TAOXUY · 2026-03-09T20:51:02Z

/retest

source/common/stats/allocator_impl.cc

ggreenway · 2026-03-10T22:30:36Z

Here's an idea for another approach: add a new method to a scope to add a stat to the scope by it's GaugeSharedPtr. Then in the destructor of the FilterState, you can just directly re-add the existing gauge into the scope, without needing it's name/tag components.

TAOXUY · 2026-03-11T02:56:01Z

Here's an idea for another approach: add a new method to a scope to add a stat to the scope by it's GaugeSharedPtr. Then in the destructor of the FilterState, you can just directly re-add the existing gauge into the scope, without needing it's name/tag components.

CMIIW, if gauge is evictable, it cannot be dec/inc. We need the central_cache in scope to hold the gauge for concurrent access.

Imagine when a gauge is incremented and then evicted before decremented, there is another
there is another accesslog accessing the same gauge using the same name and doing inc/dec, the value would be corrupted.

@kyessenov

ggreenway · 2026-03-11T16:50:22Z

CMIIW, if gauge is evictable, it cannot be dec/inc. We need the central_cache in scope to hold the gauge for concurrent access.

The central store is the Store, and all scopes reference the same store. Anytime you get a metric from the scope, if the scope does not already have it, it looks in the store, so it is not possible for two scopes to have different metrics with the same name/tags.

That's why holding a reference to the stat in the FilterState makes this work: it keeps the metric and it's current value from being removed from the Store.

Imagine when a gauge is incremented and then evicted before decremented, there is another there is another accesslog accessing the same gauge using the same name and doing inc/dec, the value would be corrupted.

In this case, because the FilterState holds a reference, both would be using the same stat for inc/dec, so the value will not be corrupted.

Signed-off-by: Xuyang Tao <taoxuy@google.com>

TAOXUY · 2026-03-11T20:11:18Z

/retest

source/common/stats/allocator.cc

ggreenway

I like this solution; it's much cleaner and clearer.

@kyessenov can you also review this, especially the change to eviction logic?

source/extensions/access_loggers/stats/stats.cc

Signed-off-by: Xuyang Tao <taoxuy@google.com>

source/extensions/access_loggers/stats/stats.h

Signed-off-by: Xuyang Tao <taoxuy@google.com>

jmarantz · 2026-04-01T18:17:18Z

Note: the joiner is used to create the stat-name that's used for ThreadLocalStore's maps, so that is the behavior you are getting whether you use it here or not :)

Looking at it most recently I think it should be a made little bit better, I think it will treat these three gauges the same:

Gauge 1:   name="foo.bar", tags={"a":"b", "c":d"} ---> flattened name = "foo.bar.a.b.c.d"
Gauge 2:  name="foo.bar.a.b", tags={"c:d"} --> flattned name = "foo.bar.a.b.c.d" -- will be the same as gauge 1
Gauge 3:   name="foo.bar.a.b.c.d", tags={} --> same thing

I think we should probably fix that in the joiner somehow. But if you have those 3 gauges in your current system, even if GaugeKey distinguishes them, they will be the same Gauge because you are using ThreadLocalStore.

Note that this issue is not common because mostly tags are not specified when creatin stats, mostly they are extracted from the name using tag-extraction regexes.

ggreenway · 2026-04-01T18:43:33Z

Oh, I understand what you're saying now. Iff the stat already exists, there's no difference because we'll find it by name, and the existing stat still has the tags it was created with.

My objection is that this is a very tight coupling between a detail of the stats system and this access logger. I think the code is more robust to stat system changes if it always calls gaugeFromStatNameWithTags instead of gaugeFromStatName.

Alternatively, in the destructor case, we could call findGauge and do nothing if it wasn't found. But I don't think we should ever call gaugeFromStatName, because the tags are essential to this working.

jmarantz · 2026-04-01T18:58:52Z

I was not suggesting we use gaugeFromStatName with the pre-concatenated name - agree that would be too tight a coupling. I was just suggesting we use the same effective hashing mechanism in the new map because (a) it is there (b) bug-for-bug compatibility seems good (c) apparently it is faster though that was not the main motivation.

ggreenway · 2026-04-01T20:59:30Z

Got it. I don't think it's actually important in this case whether this hash table uses an equivalent key hashing to the thread local stats store. In this case, it just needs to pass identical arguments to gaugeFromStatNameWithTags everytime for the same logical stat (from the access logger's point of view).

Unless there's any clear bugs, I think I'd rather move forward with the GaugeKey. It seems sufficient for this purpose.

I think to use the joiner in the hash map key, we'd need to store some additional data in the value to ensure identical arguments to gaugeFromStatNameWithTags (because the output of the joiner is flattened, and we still need the unflattened version also).

jmarantz · 2026-04-01T21:03:07Z

OK I don't feel that strongly about it.

ggreenway

/wait

source/extensions/access_loggers/stats/stats.cc

Signed-off-by: Xuyang Tao <taoxuy@google.com>

TAOXUY · 2026-04-01T22:45:24Z

/retest

TAOXUY · 2026-04-01T22:53:59Z

/retest

ggreenway

Mostly looks good, just a couple small things

/wait

test/extensions/access_loggers/stats/integration_test.cc

test/extensions/access_loggers/stats/stats_test.cc

source/extensions/access_loggers/stats/stats.cc

Signed-off-by: Xuyang Tao <taoxuy@google.com>

ggreenway

/wait

source/extensions/access_loggers/stats/stats.cc

ggreenway · 2026-04-02T17:49:08Z

test/extensions/access_loggers/stats/integration_test.cc

-        codec_client_ = makeHttpConnection(lookupPort("http"));
-        IntegrationStreamDecoderPtr response =
-            codec_client_->makeHeaderOnlyRequest(request_headers);
+  init(config_yaml, /*autonomous_upstream=*/false,


Use EXPECT_LOG_CONTAINS in this test

Signed-off-by: Xuyang Tao <taoxuy@google.com>

ggreenway · 2026-04-02T23:08:47Z

source/extensions/access_loggers/stats/stats.cc

+      inflight_gauges_.erase(it);
+    }
+  } else {
+    ENVOY_LOG_MISC(error, "Stats access logger gauge paired subtract was skipped due to no "


You made this unlimited (not periodic). I'm guessing you did this to make tests easier, but I think this needs to be limited. If that means some tests can't expect the log message, that's ok, as long as it's tested at least once somewhere. I'd prefer having the name in the message also, otherwise the user doesn't know which gauge config to look at.

Signed-off-by: Xuyang Tao <taoxuy@google.com>

ggreenway

/wait

ggreenway · 2026-04-03T21:55:02Z

test/extensions/access_loggers/stats/stats_test.cc

+  // Subtract without add -> logs instead of crashing
+  EXPECT_LOG_CONTAINS("error",
+                      "Stats access logger gauge paired subtract was skipped due to no "
+                      "corresponding add, possibly due to misconfigured events",


Suggested change

"corresponding add, possibly due to misconfigured events",

"corresponding add, possibly due to misconfigured events: gauge",

ggreenway · 2026-04-03T21:56:12Z

test/extensions/access_loggers/stats/integration_test.cc

+  if (GetParam() == Network::Address::IpVersion::v6) {
+    return; // Skip for IPv6 due to log throttling in periodic logs as IPv4 and IPv6 run in the same
+            // process.
+  }


This isn't a good pattern. Anyone running only the ipv6 versions will not have this test run at all

ggreenway · 2026-04-03T21:56:49Z

test/extensions/access_loggers/stats/integration_test.cc

+  // In debug mode, this should assert because the subtraction is attempted for a gauge that wasn't
+  // added and DownstreamEnd evaluates access logs upon stream destruction. We wrap the entire
+  // connection flow in the death test so the parent process doesn't create a mock connection that
+  // would crash during test teardown.


This comment is outdated. It's not a death test anymore.

ggreenway · 2026-04-03T21:58:53Z

test/extensions/access_loggers/stats/integration_test.cc

+  // added and DownstreamEnd evaluates access logs upon stream destruction. We wrap the entire
+  // connection flow in the death test so the parent process doesn't create a mock connection that
+  // would crash during test teardown.
+  EXPECT_LOG_CONTAINS("error",


Given only 1 test variant will see the log line, I think it's fine to remove the EXPECT_LOG_CONTAINS here, and comment on that near the bottom of the test. stats_test.cc has an EXPECT_LOG_CONTAINS for this situation already.

Signed-off-by: Xuyang Tao <taoxuy@google.com>

fix

b8fd92e

Signed-off-by: Xuyang Tao <taoxuy@google.com>

TAOXUY requested review from ggreenway, kyessenov and wbpcode as code owners March 6, 2026 05:38

iterate

4dd73b0

Signed-off-by: Xuyang Tao <taoxuy@google.com>

TAOXUY changed the title ~~StatsAccessLogger:~~ StatsAccessLogger: fixes connection gauge underflow crashes when decrementing metrics after Scope evictions. Mar 6, 2026

ggreenway self-assigned this Mar 6, 2026

ggreenway requested changes Mar 6, 2026

View reviewed changes

repokitteh-read-only bot added the waiting label Mar 6, 2026

TAOXUY added 2 commits March 8, 2026 18:09

format

f44561b

Signed-off-by: Xuyang Tao <taoxuy@google.com>

repokitteh-read-only bot removed the waiting label Mar 8, 2026

TAOXUY added 4 commits March 8, 2026 21:12

fix: restore evictionDisabled and fix test

32cc03e

Signed-off-by: Xuyang Tao <taoxuy@google.com>

fix test

b9deaef

Signed-off-by: Xuyang Tao <taoxuy@google.com>

fix test

8ae0f97

Signed-off-by: Xuyang Tao <taoxuy@google.com>

fix

4e530a7

Signed-off-by: Xuyang Tao <taoxuy@google.com>

ggreenway requested changes Mar 10, 2026

View reviewed changes

source/common/stats/allocator_impl.cc Outdated Show resolved Hide resolved

fix

d5f87d2

Signed-off-by: Xuyang Tao <taoxuy@google.com>

ggreenway reviewed Mar 11, 2026

View reviewed changes

source/common/stats/allocator.cc Show resolved Hide resolved

ggreenway requested changes Mar 11, 2026

View reviewed changes

source/extensions/access_loggers/stats/stats.cc Outdated Show resolved Hide resolved

source/extensions/access_loggers/stats/stats.cc Outdated Show resolved Hide resolved

ggreenway assigned kyessenov Mar 11, 2026

TAOXUY added 3 commits March 11, 2026 22:22

fix

e5259b5

Signed-off-by: Xuyang Tao <taoxuy@google.com>

fix

7e9176a

Signed-off-by: Xuyang Tao <taoxuy@google.com>

fix

5628f8f

Signed-off-by: Xuyang Tao <taoxuy@google.com>

iterate

06e95f5

Signed-off-by: Xuyang Tao <taoxuy@google.com>

jmarantz reviewed Apr 1, 2026

View reviewed changes

source/extensions/access_loggers/stats/stats.h Show resolved Hide resolved

iterate

b1dbefb

Signed-off-by: Xuyang Tao <taoxuy@google.com>

ggreenway requested changes Apr 1, 2026

View reviewed changes

source/extensions/access_loggers/stats/stats.cc Outdated Show resolved Hide resolved

repokitteh-read-only bot added the waiting label Apr 1, 2026

Revert to using gaugeKey

e7cde73

Signed-off-by: Xuyang Tao <taoxuy@google.com>

repokitteh-read-only bot removed the waiting label Apr 1, 2026

ggreenway requested changes Apr 2, 2026

View reviewed changes

test/extensions/access_loggers/stats/integration_test.cc Outdated Show resolved Hide resolved

test/extensions/access_loggers/stats/stats_test.cc Outdated Show resolved Hide resolved

source/extensions/access_loggers/stats/stats.cc Outdated Show resolved Hide resolved

repokitteh-read-only bot added the waiting label Apr 2, 2026

fix assertion

9370560

Signed-off-by: Xuyang Tao <taoxuy@google.com>

repokitteh-read-only bot removed the waiting label Apr 2, 2026

ggreenway requested changes Apr 2, 2026

View reviewed changes

repokitteh-read-only bot added the waiting label Apr 2, 2026

fix assertion

41615d5

Signed-off-by: Xuyang Tao <taoxuy@google.com>

repokitteh-read-only bot removed the waiting label Apr 2, 2026

ggreenway requested changes Apr 2, 2026

View reviewed changes

TAOXUY added 2 commits April 2, 2026 23:36

iterate

9312b56

Signed-off-by: Xuyang Tao <taoxuy@google.com>

fix test format

200f136

Signed-off-by: Xuyang Tao <taoxuy@google.com>

ggreenway requested changes Apr 3, 2026

View reviewed changes

repokitteh-read-only bot added the waiting label Apr 3, 2026

iterate

b8cae91

Signed-off-by: Xuyang Tao <taoxuy@google.com>

repokitteh-read-only bot removed the waiting label Apr 3, 2026

fix comment

9308935

Signed-off-by: Xuyang Tao <taoxuy@google.com>

	"corresponding add, possibly due to misconfigured events",
	"corresponding add, possibly due to misconfigured events: gauge",

Conversation

TAOXUY commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggreenway left a comment

Choose a reason for hiding this comment

Uh oh!

TAOXUY commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TAOXUY commented Mar 9, 2026

Uh oh!

Uh oh!

ggreenway commented Mar 10, 2026

Uh oh!

TAOXUY commented Mar 11, 2026

Uh oh!

ggreenway commented Mar 11, 2026

Uh oh!

TAOXUY commented Mar 11, 2026

Uh oh!

Uh oh!

ggreenway left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jmarantz commented Apr 1, 2026

Uh oh!

ggreenway commented Apr 1, 2026

Uh oh!

jmarantz commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggreenway commented Apr 1, 2026

Uh oh!

jmarantz commented Apr 1, 2026

Uh oh!

ggreenway left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

TAOXUY commented Apr 1, 2026

Uh oh!

TAOXUY commented Apr 1, 2026

Uh oh!

ggreenway left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ggreenway left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggreenway left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

TAOXUY commented Mar 6, 2026 •

edited

Loading

TAOXUY commented Mar 8, 2026 •

edited

Loading

jmarantz commented Apr 1, 2026 •

edited

Loading