server: avoid flushing while a flush is in progress#16370
server: avoid flushing while a flush is in progress#16370snowp merged 6 commits intoenvoyproxy:mainfrom
Conversation
Signed-off-by: Snow Pettersen <snowp@lyft.com>
|
@jmarantz This is a fairly naive solution to the problem so wanted to put this up for feedback before I started writing tests. I'm not sure if it makes sense to support other ways of handling this (e.g. enqueue a flush instead of dropping it), or if it makes more sense to make this part of the stats store (e.g. have mergeHistogram do nothing if merging is already happening). |
|
I like this approach. I think if stat-flushing is slow you don't want to queue them. You want to drop flushes if they are not done. I would suggest adding a state counting the dropped flushes. I think that would be more useful to us than info-logs. |
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
|
How responsive does stat-flushing need to be? If it could wait you could queue up a flush and cancel the timer until the flush is complete... |
I think we already pause the timer until we complete the flush (ie its scheduled after the flush is complete). The issue this PR is trying to fix is allowing out of band flushes to not conflict with scheduled ones. /retest |
|
Retrying Azure Pipelines: |
jmarantz
left a comment
There was a problem hiding this comment.
Looks great with one minor comment.
test/integration/server.h
Outdated
| void mergeHistograms(PostMergeCb) override {} | ||
| void mergeHistograms(PostMergeCb cb) override { merge_cb_ = cb; } | ||
|
|
||
| PostMergeCb merge_cb_; |
There was a problem hiding this comment.
nit: expose runMergeCallback() as a public function and leave merge_cb_ private.
Signed-off-by: Snow Pettersen <snowp@lyft.com>
|
tsan failure in //test/integration:multiplexed_integration_test , 2/5 times. Not sure whether that's related. |
|
I see the same failures listed in #test-flaky, so I believe these are also happening on main |
In order to support calls to
Server::Instance::flushStats(let's call this an "external flush") while also having a flush timer activated we need to handle an external flush being requested while in the process of doing a periodic flush. As it stands,this currently runs the risk of triggering an ASSERT due to the histogram merging being async. See envoyproxy/envoy-mobile#748 for an example of this happening.
This PR proposes a simple solution where we simply ignore calls to flushStats when we're still in the process of merging
histograms.
Signed-off-by: Snow Pettersen snowp@lyft.com
Risk Level: Medium
Testing: UTs
Docs Changes: n/a
Release Notes: n/a