Skip to content

event: update fd registration mask even if it hasn't changed.#16389

Merged
wrowe merged 7 commits intoenvoyproxy:mainfrom
antoniovicente:lost_events
May 24, 2021
Merged

event: update fd registration mask even if it hasn't changed.#16389
wrowe merged 7 commits intoenvoyproxy:mainfrom
antoniovicente:lost_events

Conversation

@antoniovicente
Copy link
Copy Markdown
Contributor

@antoniovicente antoniovicente commented May 8, 2021

Commit Message:
event: update fd registration mask even if it hasn't changed.

Updates to the fd mask can result in new events when operating in EDGE trigger mode.
Doing this update unconditionally is specially important in cases where there was a synthetic event scheduled
and setEnabled ends up clearing it before the call to updateEvents. By skipping the update we prevent the
generation of a new real event to replace the lost synthetic event.

Without this change, calling close(Flush) a socket that is readDisabled and had a pending synthetic write event
can result in th write event never being delivered so the final flush will fail due to timeout.

Additional Description:
The issue was introduced by this optimization request from me which works for EmulatedEdge but not for Edge triggered sockets: https://github.com/envoyproxy/envoy/pull/13787/files#r520315847

I don't know if there are current situations where this bug would trigger in the proxy. I ran into this while trying to change the order of operations when generating HTTP/1.0 responses framed by connection close; before my prototype changes there was a call to readDisable(false) before Connection::close(Flush) which resulted in existing tests to work.

Risk Level: low
Testing: unit
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a

Updates to the fd mask can result in new events when operating in EDGE trigger mode.
Doing this update unconditionally is specially important in cases where there was a synthetic event scheduled
and setEnabled ends up clearing it before the call to updateEvents since by skipping the update we prevent the
generation of a new real event to replace the lost synthetic event.

Without this change, calling close(Flush) a socket that is readDisabled and had a pending synthetic write event
can result in th write event never being delivered so the final flush will fail due to timeout.

Signed-off-by: Antonio Vicente <avd@google.com>
Signed-off-by: Antonio Vicente <avd@google.com>
Signed-off-by: Antonio Vicente <avd@google.com>
@antoniovicente
Copy link
Copy Markdown
Contributor Author

/assign-from @envoyproxy/first-pass-reviewers

@repokitteh-read-only
Copy link
Copy Markdown

@envoyproxy/first-pass-reviewers assignee is @dio

🐱

Caused by: a #16389 (comment) was created by @antoniovicente.

see: more, trace.

@davinci26
Copy link
Copy Markdown
Member

davinci26 commented May 12, 2021

I have this PR under my radar, but I haven't got time to understand the edge case exactly. This is why I haven't commented on it

davinci26
davinci26 previously approved these changes May 12, 2021
Copy link
Copy Markdown
Member

@davinci26 davinci26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing a Windows issue!

@antoniovicente
Copy link
Copy Markdown
Contributor Author

Thanks for fixing a Windows issue!

Windows works fine AFAIK, it's linux that has the issues.

Thanks for the review!

@antoniovicente
Copy link
Copy Markdown
Contributor Author

/assign-from @envoyproxy/senior-maintainers

@repokitteh-read-only
Copy link
Copy Markdown

@envoyproxy/senior-maintainers assignee is @alyssawilk

🐱

Caused by: a #16389 (comment) was created by @antoniovicente.

see: more, trace.

Copy link
Copy Markdown
Contributor

@alyssawilk alyssawilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thorough explanation (and sorry for review delay - second shot hit me hard)
LGTM modulo some explanatory comments :-)

if (trigger_ == FileTriggerType::EmulatedEdge) {
auto new_event_mask = enabled_events_ & ~event;
updateEvents(new_event_mask);
if (new_event_mask != enabled_events_) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment here and below why this case doesn't need the update?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writing the comment for updateEvents made me rethink what we should do here. I reverted these changes and instead added a trigger_mode_ check to updateEvents so the update is skipped in modes where it is truly a no-op.

@@ -85,9 +85,6 @@ void FileEventImpl::assignEvents(uint32_t events, event_base* base) {

void FileEventImpl::updateEvents(uint32_t events) {
ASSERT(dispatcher_.isThreadSafe());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like either a comment here on why this is important, or a comment in the test calling out that it's regression testing [info from PR description] just to ensure no clever person decides to improve perf by undoing this PR :-P

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment added. Thanks for asking for further info in the code, there's a lot of subtle behavior that can be accidentally missed.

Signed-off-by: Antonio Vicente <avd@google.com>
Signed-off-by: Antonio Vicente <avd@google.com>
@antoniovicente
Copy link
Copy Markdown
Contributor Author

Thanks for the thorough explanation (and sorry for review delay - second shot hit me hard)
LGTM modulo some explanatory comments :-)

I know the feeling. The second shot is rough but so worth it.

alyssawilk
alyssawilk previously approved these changes May 20, 2021
Copy link
Copy Markdown
Contributor

@alyssawilk alyssawilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@alyssawilk
Copy link
Copy Markdown
Contributor

Coverage flaked. I'll kick off another run but feel free to merge once CI is happy

@antoniovicente
Copy link
Copy Markdown
Contributor Author

Coverage flaked. I'll kick off another run but feel free to merge once CI is happy

The coverage failure is real, I'll fix it by adding a test case for Level trigger events.

/wait

Signed-off-by: Antonio Vicente <avd@google.com>
Signed-off-by: Antonio Vicente <avd@google.com>
@wrowe
Copy link
Copy Markdown
Contributor

wrowe commented May 21, 2021

//test/integration:integration_test is a known flake in CI, investigating, don't let that stop you if the rest of CI passes.

@antoniovicente
Copy link
Copy Markdown
Contributor Author

The issues I see right now are bazel RPC failure while building ASAN and TSAN failure related to the issue I'm trying to address in #16590

/retest

@repokitteh-read-only
Copy link
Copy Markdown

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #16389 (comment) was created by @antoniovicente.

see: more, trace.

@wrowe wrowe merged commit fb274d6 into envoyproxy:main May 24, 2021
leyao-daily pushed a commit to leyao-daily/envoy that referenced this pull request Sep 30, 2021
…roxy#16389)

* event: update fd registration mask even if it hasn't changed.

Updates to the fd mask can result in new events when operating in EDGE trigger mode.
Doing this update unconditionally is specially important in cases where there was a synthetic event scheduled
and setEnabled ends up clearing it before the call to updateEvents since by skipping the update we prevent the
generation of a new real event to replace the lost synthetic event.

Without this change, calling close(Flush) a socket that is readDisabled and had a pending synthetic write event
can result in th write event never being delivered so the final flush will fail due to timeout.

* Remove call to dispatcher exit that is not needed

Signed-off-by: Antonio Vicente <avd@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants