Skip to content

server: avoid flushing while a flush is in progress#16370

Merged
snowp merged 6 commits intoenvoyproxy:mainfrom
snowp:flush-during-flush
May 12, 2021
Merged

server: avoid flushing while a flush is in progress#16370
snowp merged 6 commits intoenvoyproxy:mainfrom
snowp:flush-during-flush

Conversation

@snowp
Copy link
Copy Markdown
Contributor

@snowp snowp commented May 6, 2021

In order to support calls to Server::Instance::flushStats (let's call this an "external flush") while also having a flush timer activated we need to handle an external flush being requested while in the process of doing a periodic flush. As it stands,
this currently runs the risk of triggering an ASSERT due to the histogram merging being async. See envoyproxy/envoy-mobile#748 for an example of this happening.

This PR proposes a simple solution where we simply ignore calls to flushStats when we're still in the process of merging
histograms.

Signed-off-by: Snow Pettersen snowp@lyft.com

Risk Level: Medium
Testing: UTs
Docs Changes: n/a
Release Notes: n/a

Signed-off-by: Snow Pettersen <snowp@lyft.com>
@snowp
Copy link
Copy Markdown
Contributor Author

snowp commented May 6, 2021

@jmarantz This is a fairly naive solution to the problem so wanted to put this up for feedback before I started writing tests. I'm not sure if it makes sense to support other ways of handling this (e.g. enqueue a flush instead of dropping it), or if it makes more sense to make this part of the stats store (e.g. have mergeHistogram do nothing if merging is already happening).

@jmarantz
Copy link
Copy Markdown
Contributor

jmarantz commented May 6, 2021

I like this approach. I think if stat-flushing is slow you don't want to queue them. You want to drop flushes if they are not done.

I would suggest adding a state counting the dropped flushes. I think that would be more useful to us than info-logs.

Snow Pettersen added 3 commits May 7, 2021 17:32
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
@snowp snowp marked this pull request as ready for review May 7, 2021 18:12
@dmitri-d
Copy link
Copy Markdown
Contributor

dmitri-d commented May 7, 2021

How responsive does stat-flushing need to be? If it could wait you could queue up a flush and cancel the timer until the flush is complete...

@snowp
Copy link
Copy Markdown
Contributor Author

snowp commented May 8, 2021

How responsive does stat-flushing need to be? If it could wait you could queue up a flush and cancel the timer until the flush is complete...

I think we already pause the timer until we complete the flush (ie its scheduled after the flush is complete). The issue this PR is trying to fix is allowing out of band flushes to not conflict with scheduled ones.

/retest

@repokitteh-read-only
Copy link
Copy Markdown

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #16370 (comment) was created by @snowp.

see: more, trace.

Copy link
Copy Markdown
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great with one minor comment.

void mergeHistograms(PostMergeCb) override {}
void mergeHistograms(PostMergeCb cb) override { merge_cb_ = cb; }

PostMergeCb merge_cb_;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: expose runMergeCallback() as a public function and leave merge_cb_ private.

Snow Pettersen added 2 commits May 11, 2021 18:39
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Copy link
Copy Markdown
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@jmarantz
Copy link
Copy Markdown
Contributor

tsan failure in //test/integration:multiplexed_integration_test , 2/5 times. Not sure whether that's related.

@snowp
Copy link
Copy Markdown
Contributor Author

snowp commented May 12, 2021

I see the same failures listed in #test-flaky, so I believe these are also happening on main

@snowp snowp merged commit c784d89 into envoyproxy:main May 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants