StatsAccessLogger: fixes connection gauge underflow crashes when decrementing metrics after Scope evictions.#43812
StatsAccessLogger: fixes connection gauge underflow crashes when decrementing metrics after Scope evictions.#43812TAOXUY wants to merge 77 commits intoenvoyproxy:mainfrom
Conversation
ggreenway
left a comment
There was a problem hiding this comment.
I don't think your fix is quite right.
I ran the integration test you added without your code changes, and it fails in an assertion ASSERT(used() || amount == 0); in sub(). I think either the assertion is no longer valid in the case of evicted stats, or the stat is being set to unused incorrectly.
if (scope->evictable_) {
MetricBag metrics(scope->scope_id_);
CentralCacheEntrySharedPtr& central_cache = scope->centralCacheMutableNoThreadAnalysis();
auto filter_unused = []<typename T>(StatNameHashMap<T>& unused_metrics) {
return [&unused_metrics](std::pair<StatName, T> kv) {
const auto& [name, metric] = kv;
if (metric->used()) {
metric->markUnused();
return false;
} else {
unused_metrics.try_emplace(name, metric);
return true;
}
};
};
The above code assumes that a stat is only ever held by a single scope (or other holder of a reference), which isn't correct. cc @kyessenov .
I think the use of std::min around all the sub() calls means that it's likely the counter could be incorrect. Even if this change prevents it from going negative, I think it is still an incorrect count.
/wait
When evicting unused stats from the central cache, we need to ensure that gauges actively referenced by components like AccessLogState are not evicted. The use_count() > 1 check prevents this, but a previous bug in evictUnused where the lambda parameter std::pair<StatName, T> kv was captured by value caused artificial inflation of the use_count due to the deep copy. This broke eviction entirely across the codebase. This commit fixes evictUnused by taking const auto& kv by reference, avoiding the deep copy and correctly applying the use_count() > 1 safeguard. Furthermore, AccessLogState now properly holds a GaugeSharedPtr in its State struct so its active references prevent premature eviction by evictUnused. The erroneous std::min safeguard during gauge subtractions is also removed as AccessLogState gauges will no longer be unfairly cleared. Signed-off-by: Xuyang Tao <taoxuy@google.com>
Updated with a interface to not evict per metric. We need to keep gauge not evicted in the scope as that it can be looked-up and then dec/inc on the same gauge. @kyessenov |
|
/retest |
|
Here's an idea for another approach: add a new method to a scope to add a stat to the scope by it's GaugeSharedPtr. Then in the destructor of the FilterState, you can just directly re-add the existing gauge into the scope, without needing it's name/tag components. |
CMIIW, if gauge is evictable, it cannot be Imagine when a gauge is incremented and then evicted before decremented, there is another |
The central store is the That's why holding a reference to the stat in the FilterState makes this work: it keeps the metric and it's current value from being removed from the
In this case, because the FilterState holds a reference, both would be using the same stat for inc/dec, so the value will not be corrupted. |
|
/retest |
ggreenway
left a comment
There was a problem hiding this comment.
I like this solution; it's much cleaner and clearer.
@kyessenov can you also review this, especially the change to eviction logic?
|
Note: the joiner is used to create the stat-name that's used for ThreadLocalStore's maps, so that is the behavior you are getting whether you use it here or not :) Looking at it most recently I think it should be a made little bit better, I think it will treat these three gauges the same: I think we should probably fix that in the joiner somehow. But if you have those 3 gauges in your current system, even if GaugeKey distinguishes them, they will be the same Gauge because you are using ThreadLocalStore. Note that this issue is not common because mostly tags are not specified when creatin stats, mostly they are extracted from the name using tag-extraction regexes. |
|
Oh, I understand what you're saying now. Iff the stat already exists, there's no difference because we'll find it by name, and the existing stat still has the tags it was created with. My objection is that this is a very tight coupling between a detail of the stats system and this access logger. I think the code is more robust to stat system changes if it always calls Alternatively, in the destructor case, we could call |
|
I was not suggesting we use gaugeFromStatName with the pre-concatenated name - agree that would be too tight a coupling. I was just suggesting we use the same effective hashing mechanism in the new map because (a) it is there (b) bug-for-bug compatibility seems good (c) apparently it is faster though that was not the main motivation. |
|
Got it. I don't think it's actually important in this case whether this hash table uses an equivalent key hashing to the thread local stats store. In this case, it just needs to pass identical arguments to Unless there's any clear bugs, I think I'd rather move forward with the GaugeKey. It seems sufficient for this purpose. I think to use the joiner in the hash map key, we'd need to store some additional data in the value to ensure identical arguments to |
|
OK I don't feel that strongly about it. |
Signed-off-by: Xuyang Tao <taoxuy@google.com>
|
/retest |
1 similar comment
|
/retest |
ggreenway
left a comment
There was a problem hiding this comment.
Mostly looks good, just a couple small things
/wait
Signed-off-by: Xuyang Tao <taoxuy@google.com>
| codec_client_ = makeHttpConnection(lookupPort("http")); | ||
| IntegrationStreamDecoderPtr response = | ||
| codec_client_->makeHeaderOnlyRequest(request_headers); | ||
| init(config_yaml, /*autonomous_upstream=*/false, |
There was a problem hiding this comment.
Use EXPECT_LOG_CONTAINS in this test
Signed-off-by: Xuyang Tao <taoxuy@google.com>
| inflight_gauges_.erase(it); | ||
| } | ||
| } else { | ||
| ENVOY_LOG_MISC(error, "Stats access logger gauge paired subtract was skipped due to no " |
There was a problem hiding this comment.
You made this unlimited (not periodic). I'm guessing you did this to make tests easier, but I think this needs to be limited. If that means some tests can't expect the log message, that's ok, as long as it's tested at least once somewhere. I'd prefer having the name in the message also, otherwise the user doesn't know which gauge config to look at.
Signed-off-by: Xuyang Tao <taoxuy@google.com>
| // Subtract without add -> logs instead of crashing | ||
| EXPECT_LOG_CONTAINS("error", | ||
| "Stats access logger gauge paired subtract was skipped due to no " | ||
| "corresponding add, possibly due to misconfigured events", |
There was a problem hiding this comment.
| "corresponding add, possibly due to misconfigured events", | |
| "corresponding add, possibly due to misconfigured events: gauge", |
| if (GetParam() == Network::Address::IpVersion::v6) { | ||
| return; // Skip for IPv6 due to log throttling in periodic logs as IPv4 and IPv6 run in the same | ||
| // process. | ||
| } |
There was a problem hiding this comment.
This isn't a good pattern. Anyone running only the ipv6 versions will not have this test run at all
| // In debug mode, this should assert because the subtraction is attempted for a gauge that wasn't | ||
| // added and DownstreamEnd evaluates access logs upon stream destruction. We wrap the entire | ||
| // connection flow in the death test so the parent process doesn't create a mock connection that | ||
| // would crash during test teardown. |
There was a problem hiding this comment.
This comment is outdated. It's not a death test anymore.
| // added and DownstreamEnd evaluates access logs upon stream destruction. We wrap the entire | ||
| // connection flow in the death test so the parent process doesn't create a mock connection that | ||
| // would crash during test teardown. | ||
| EXPECT_LOG_CONTAINS("error", |
There was a problem hiding this comment.
Given only 1 test variant will see the log line, I think it's fine to remove the EXPECT_LOG_CONTAINS here, and comment on that near the bottom of the test. stats_test.cc has an EXPECT_LOG_CONTAINS for this situation already.
Signed-off-by: Xuyang Tao <taoxuy@google.com>
Description: Fixes connection gauge underflow crashes in the Stats Access Logger when decrementing metrics after Scope evictions.
The original code correctly attempted to prevent "zombie" gauges by re-resolving metrics against the central store (via scope_->gaugeFromStatNameWithTags) during request destruction. However, it tried to reconstruct the gauge's identity using gauge_->tagExtractedStatName(). This failed because dynamic access-log tags (like %REQUEST_HEADER(...)%) are not registered with Envoy's global extractors. The extraction process returned a mangled base name and empty tags, forcing Scope to create a new 0-valued gauge. Subtracting 1 from it immediately crashed Envoy with a counter underflow.
Fix: we will keep the gauge in the scope cache if it is non-zero
Risk Level: Low
Testing: Added StatsAccessLogIntegrationTest.ActiveRequestsGaugeScopeEviction, which synthetically forces an asynchronous scope eviction while a connection is still inflight. Verified that the gauge successfully decrements to 0 in the central store identically to a normal request finish.
Docs: NA
Release: NA
Platform Specific Features: no