Skip to content

StatsAccessLogger: fixes connection gauge underflow crashes when decrementing metrics after Scope evictions.#43812

Open
TAOXUY wants to merge 77 commits intoenvoyproxy:mainfrom
TAOXUY:fixStatDestructor
Open

StatsAccessLogger: fixes connection gauge underflow crashes when decrementing metrics after Scope evictions.#43812
TAOXUY wants to merge 77 commits intoenvoyproxy:mainfrom
TAOXUY:fixStatDestructor

Conversation

@TAOXUY
Copy link
Copy Markdown
Contributor

@TAOXUY TAOXUY commented Mar 6, 2026

Description: Fixes connection gauge underflow crashes in the Stats Access Logger when decrementing metrics after Scope evictions.

The original code correctly attempted to prevent "zombie" gauges by re-resolving metrics against the central store (via scope_->gaugeFromStatNameWithTags) during request destruction. However, it tried to reconstruct the gauge's identity using gauge_->tagExtractedStatName(). This failed because dynamic access-log tags (like %REQUEST_HEADER(...)%) are not registered with Envoy's global extractors. The extraction process returned a mangled base name and empty tags, forcing Scope to create a new 0-valued gauge. Subtracting 1 from it immediately crashed Envoy with a counter underflow.

Fix: we will keep the gauge in the scope cache if it is non-zero

Risk Level: Low

Testing: Added StatsAccessLogIntegrationTest.ActiveRequestsGaugeScopeEviction, which synthetically forces an asynchronous scope eviction while a connection is still inflight. Verified that the gauge successfully decrements to 0 in the central store identically to a normal request finish.

Docs: NA

Release: NA

Platform Specific Features: no

Signed-off-by: Xuyang Tao <taoxuy@google.com>
Signed-off-by: Xuyang Tao <taoxuy@google.com>
@TAOXUY TAOXUY changed the title StatsAccessLogger: StatsAccessLogger: fixes connection gauge underflow crashes when decrementing metrics after Scope evictions. Mar 6, 2026
@ggreenway ggreenway self-assigned this Mar 6, 2026
Copy link
Copy Markdown
Member

@ggreenway ggreenway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think your fix is quite right.

I ran the integration test you added without your code changes, and it fails in an assertion ASSERT(used() || amount == 0); in sub(). I think either the assertion is no longer valid in the case of evicted stats, or the stat is being set to unused incorrectly.

      if (scope->evictable_) {
        MetricBag metrics(scope->scope_id_);
        CentralCacheEntrySharedPtr& central_cache = scope->centralCacheMutableNoThreadAnalysis();
        auto filter_unused = []<typename T>(StatNameHashMap<T>& unused_metrics) {
          return [&unused_metrics](std::pair<StatName, T> kv) {
            const auto& [name, metric] = kv;
            if (metric->used()) {
              metric->markUnused();
              return false;
            } else {
              unused_metrics.try_emplace(name, metric);
              return true;
            }
          };
        };

The above code assumes that a stat is only ever held by a single scope (or other holder of a reference), which isn't correct. cc @kyessenov .

I think the use of std::min around all the sub() calls means that it's likely the counter could be incorrect. Even if this change prevents it from going negative, I think it is still an incorrect count.

/wait

TAOXUY added 2 commits March 8, 2026 18:09
When evicting unused stats from the central cache, we need to ensure that
gauges actively referenced by components like AccessLogState are not evicted.
The use_count() > 1 check prevents this, but a previous bug in evictUnused
where the lambda parameter std::pair<StatName, T> kv was captured by value
caused artificial inflation of the use_count due to the deep copy. This broke
eviction entirely across the codebase.

This commit fixes evictUnused by taking const auto& kv by reference, avoiding
the deep copy and correctly applying the use_count() > 1 safeguard.

Furthermore, AccessLogState now properly holds a GaugeSharedPtr in its State
struct so its active references prevent premature eviction by evictUnused. The
erroneous std::min safeguard during gauge subtractions is also removed as
AccessLogState gauges will no longer be unfairly cleared.

Signed-off-by: Xuyang Tao <taoxuy@google.com>
Signed-off-by: Xuyang Tao <taoxuy@google.com>
@TAOXUY
Copy link
Copy Markdown
Contributor Author

TAOXUY commented Mar 8, 2026

I don't think your fix is quite right.

I ran the integration test you added without your code changes, and it fails in an assertion ASSERT(used() || amount == 0); in sub(). I think either the assertion is no longer valid in the case of evicted stats, or the stat is being set to unused incorrectly.

      if (scope->evictable_) {
        MetricBag metrics(scope->scope_id_);
        CentralCacheEntrySharedPtr& central_cache = scope->centralCacheMutableNoThreadAnalysis();
        auto filter_unused = []<typename T>(StatNameHashMap<T>& unused_metrics) {
          return [&unused_metrics](std::pair<StatName, T> kv) {
            const auto& [name, metric] = kv;
            if (metric->used()) {
              metric->markUnused();
              return false;
            } else {
              unused_metrics.try_emplace(name, metric);
              return true;
            }
          };
        };

The above code assumes that a stat is only ever held by a single scope (or other holder of a reference), which isn't correct. cc @kyessenov .

I think the use of std::min around all the sub() calls means that it's likely the counter could be incorrect. Even if this change prevents it from going negative, I think it is still an incorrect count.

/wait

Updated with a interface to not evict per metric. We need to keep gauge not evicted in the scope as that it can be looked-up and then dec/inc on the same gauge. @kyessenov

TAOXUY added 4 commits March 8, 2026 21:12
Signed-off-by: Xuyang Tao <taoxuy@google.com>
Signed-off-by: Xuyang Tao <taoxuy@google.com>
Signed-off-by: Xuyang Tao <taoxuy@google.com>
Signed-off-by: Xuyang Tao <taoxuy@google.com>
@TAOXUY
Copy link
Copy Markdown
Contributor Author

TAOXUY commented Mar 9, 2026

/retest

@ggreenway
Copy link
Copy Markdown
Member

Here's an idea for another approach: add a new method to a scope to add a stat to the scope by it's GaugeSharedPtr. Then in the destructor of the FilterState, you can just directly re-add the existing gauge into the scope, without needing it's name/tag components.

@TAOXUY
Copy link
Copy Markdown
Contributor Author

TAOXUY commented Mar 11, 2026

Here's an idea for another approach: add a new method to a scope to add a stat to the scope by it's GaugeSharedPtr. Then in the destructor of the FilterState, you can just directly re-add the existing gauge into the scope, without needing it's name/tag components.

CMIIW, if gauge is evictable, it cannot be dec/inc. We need the central_cache in scope to hold the gauge for concurrent access.

Imagine when a gauge is incremented and then evicted before decremented, there is another
there is another accesslog accessing the same gauge using the same name and doing inc/dec, the value would be corrupted.

@kyessenov

@ggreenway
Copy link
Copy Markdown
Member

CMIIW, if gauge is evictable, it cannot be dec/inc. We need the central_cache in scope to hold the gauge for concurrent access.

The central store is the Store, and all scopes reference the same store. Anytime you get a metric from the scope, if the scope does not already have it, it looks in the store, so it is not possible for two scopes to have different metrics with the same name/tags.

That's why holding a reference to the stat in the FilterState makes this work: it keeps the metric and it's current value from being removed from the Store.

Imagine when a gauge is incremented and then evicted before decremented, there is another there is another accesslog accessing the same gauge using the same name and doing inc/dec, the value would be corrupted.

In this case, because the FilterState holds a reference, both would be using the same stat for inc/dec, so the value will not be corrupted.

Signed-off-by: Xuyang Tao <taoxuy@google.com>
@TAOXUY
Copy link
Copy Markdown
Contributor Author

TAOXUY commented Mar 11, 2026

/retest

Copy link
Copy Markdown
Member

@ggreenway ggreenway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this solution; it's much cleaner and clearer.

@kyessenov can you also review this, especially the change to eviction logic?

TAOXUY added 3 commits March 11, 2026 22:22
Signed-off-by: Xuyang Tao <taoxuy@google.com>
Signed-off-by: Xuyang Tao <taoxuy@google.com>
Signed-off-by: Xuyang Tao <taoxuy@google.com>
Signed-off-by: Xuyang Tao <taoxuy@google.com>
Signed-off-by: Xuyang Tao <taoxuy@google.com>
@jmarantz
Copy link
Copy Markdown
Contributor

jmarantz commented Apr 1, 2026

Note: the joiner is used to create the stat-name that's used for ThreadLocalStore's maps, so that is the behavior you are getting whether you use it here or not :)

Looking at it most recently I think it should be a made little bit better, I think it will treat these three gauges the same:

Gauge 1:   name="foo.bar", tags={"a":"b", "c":d"} ---> flattened name = "foo.bar.a.b.c.d"
Gauge 2:  name="foo.bar.a.b", tags={"c:d"} --> flattned name = "foo.bar.a.b.c.d" -- will be the same as gauge 1
Gauge 3:   name="foo.bar.a.b.c.d", tags={} --> same thing

I think we should probably fix that in the joiner somehow. But if you have those 3 gauges in your current system, even if GaugeKey distinguishes them, they will be the same Gauge because you are using ThreadLocalStore.

Note that this issue is not common because mostly tags are not specified when creatin stats, mostly they are extracted from the name using tag-extraction regexes.

@ggreenway
Copy link
Copy Markdown
Member

Oh, I understand what you're saying now. Iff the stat already exists, there's no difference because we'll find it by name, and the existing stat still has the tags it was created with.

My objection is that this is a very tight coupling between a detail of the stats system and this access logger. I think the code is more robust to stat system changes if it always calls gaugeFromStatNameWithTags instead of gaugeFromStatName.

Alternatively, in the destructor case, we could call findGauge and do nothing if it wasn't found. But I don't think we should ever call gaugeFromStatName, because the tags are essential to this working.

@jmarantz
Copy link
Copy Markdown
Contributor

jmarantz commented Apr 1, 2026

I was not suggesting we use gaugeFromStatName with the pre-concatenated name - agree that would be too tight a coupling. I was just suggesting we use the same effective hashing mechanism in the new map because (a) it is there (b) bug-for-bug compatibility seems good (c) apparently it is faster though that was not the main motivation.

@ggreenway
Copy link
Copy Markdown
Member

Got it. I don't think it's actually important in this case whether this hash table uses an equivalent key hashing to the thread local stats store. In this case, it just needs to pass identical arguments to gaugeFromStatNameWithTags everytime for the same logical stat (from the access logger's point of view).

Unless there's any clear bugs, I think I'd rather move forward with the GaugeKey. It seems sufficient for this purpose.

I think to use the joiner in the hash map key, we'd need to store some additional data in the value to ensure identical arguments to gaugeFromStatNameWithTags (because the output of the joiner is flattened, and we still need the unflattened version also).

@jmarantz
Copy link
Copy Markdown
Contributor

jmarantz commented Apr 1, 2026

OK I don't feel that strongly about it.

Copy link
Copy Markdown
Member

@ggreenway ggreenway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/wait

Signed-off-by: Xuyang Tao <taoxuy@google.com>
@TAOXUY
Copy link
Copy Markdown
Contributor Author

TAOXUY commented Apr 1, 2026

/retest

1 similar comment
@TAOXUY
Copy link
Copy Markdown
Contributor Author

TAOXUY commented Apr 1, 2026

/retest

Copy link
Copy Markdown
Member

@ggreenway ggreenway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good, just a couple small things

/wait

Signed-off-by: Xuyang Tao <taoxuy@google.com>
Copy link
Copy Markdown
Member

@ggreenway ggreenway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/wait

codec_client_ = makeHttpConnection(lookupPort("http"));
IntegrationStreamDecoderPtr response =
codec_client_->makeHeaderOnlyRequest(request_headers);
init(config_yaml, /*autonomous_upstream=*/false,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use EXPECT_LOG_CONTAINS in this test

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Signed-off-by: Xuyang Tao <taoxuy@google.com>
inflight_gauges_.erase(it);
}
} else {
ENVOY_LOG_MISC(error, "Stats access logger gauge paired subtract was skipped due to no "
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You made this unlimited (not periodic). I'm guessing you did this to make tests easier, but I think this needs to be limited. If that means some tests can't expect the log message, that's ok, as long as it's tested at least once somewhere. I'd prefer having the name in the message also, otherwise the user doesn't know which gauge config to look at.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

TAOXUY added 2 commits April 2, 2026 23:36
Signed-off-by: Xuyang Tao <taoxuy@google.com>
Signed-off-by: Xuyang Tao <taoxuy@google.com>
Copy link
Copy Markdown
Member

@ggreenway ggreenway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/wait

// Subtract without add -> logs instead of crashing
EXPECT_LOG_CONTAINS("error",
"Stats access logger gauge paired subtract was skipped due to no "
"corresponding add, possibly due to misconfigured events",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"corresponding add, possibly due to misconfigured events",
"corresponding add, possibly due to misconfigured events: gauge",

Comment on lines +233 to +236
if (GetParam() == Network::Address::IpVersion::v6) {
return; // Skip for IPv6 due to log throttling in periodic logs as IPv4 and IPv6 run in the same
// process.
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a good pattern. Anyone running only the ipv6 versions will not have this test run at all

Comment on lines +264 to +267
// In debug mode, this should assert because the subtraction is attempted for a gauge that wasn't
// added and DownstreamEnd evaluates access logs upon stream destruction. We wrap the entire
// connection flow in the death test so the parent process doesn't create a mock connection that
// would crash during test teardown.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is outdated. It's not a death test anymore.

// added and DownstreamEnd evaluates access logs upon stream destruction. We wrap the entire
// connection flow in the death test so the parent process doesn't create a mock connection that
// would crash during test teardown.
EXPECT_LOG_CONTAINS("error",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given only 1 test variant will see the log line, I think it's fine to remove the EXPECT_LOG_CONTAINS here, and comment on that near the bottom of the test. stats_test.cc has an EXPECT_LOG_CONTAINS for this situation already.

Signed-off-by: Xuyang Tao <taoxuy@google.com>
Signed-off-by: Xuyang Tao <taoxuy@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants