Skip to content

listener: listen socket factory can be cloned from listener draining …#18686

Merged
mattklein123 merged 2 commits intoenvoyproxy:mainfrom
lambdai:fixsetnewordraining
Oct 20, 2021
Merged

listener: listen socket factory can be cloned from listener draining …#18686
mattklein123 merged 2 commits intoenvoyproxy:mainfrom
lambdai:fixsetnewordraining

Conversation

@lambdai
Copy link
Copy Markdown
Contributor

@lambdai lambdai commented Oct 19, 2021

…filter chain

Commit Message:
Currently the in place updated old listener holds the listen fd during the drain timeout.

If the corresponding active listener is removed and added back, the added back listener creates fresh new passive sockets with new inodes.
The kernel spread the accepted requests to the old listen fd and new listenfd.

The accepted sockets to the old listen fd will be lost. Even somehow the kernel/ebpf requeue these sockets,
these sockets will be handled by envoy after the drain timeout is triggered.

This PR duplicate the listen fd from the old listener. The new listen fd share the same inode.
The newly create listener will consume the accepted sockets if any.

Signed-off-by: Yuchen Dai silentdai@gmail.com

Additional Description:
Risk Level:
Testing: Unittest and a bunch of istio e2e tests
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
Fix #18616
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional API Considerations:]

…filter chain

Signed-off-by: Yuchen Dai <silentdai@gmail.com>
@lambdai
Copy link
Copy Markdown
Contributor Author

lambdai commented Oct 19, 2021

/assign @mattklein123

@lambdai
Copy link
Copy Markdown
Contributor Author

lambdai commented Oct 19, 2021

I would argue that #18677 is also needed.

In the test case, we add the listener 0.0.0.0:8080. It's happy that the requests arrived prior to the add is queued.

However, the add back is pretty much unpredictable.
I would expect stopping the listener 0.0.0.0:8080 means envoy would reject the new request ASAP.

@lambdai
Copy link
Copy Markdown
Contributor Author

lambdai commented Oct 20, 2021

/retest

@repokitteh-read-only
Copy link
Copy Markdown

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #18686 (comment) was created by @lambdai.

see: more, trace.

@mattklein123
Copy link
Copy Markdown
Member

I would argue that #18677 is also needed.

Can you explain this more please? From our lengthy offline discussion I know don't understand how #18677 helps. As long as the sockets are duplicated/cloned, there should be no issue?

Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, makes sense to me with small comment.

/wait

Signed-off-by: Yuchen Dai <silentdai@gmail.com>
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! We should backport this obviously.

@lambdai
Copy link
Copy Markdown
Contributor Author

lambdai commented Oct 20, 2021

I would argue that #18677 is also needed.

Can you explain this more please? From our lengthy offline discussion I know don't understand how #18677 helps. As long as the sockets are duplicated/cloned, there should be no issue?

There is no issue if we add a fresh new listener after we stop the listener on that address.

Think about why adding back a fresh listener can duplicate that draining filter chain listener.
It is because the old inplace updated listener doesn't close the listen socket during the drain_timeout after the listener is told to remove.

The state of the system is described here:
https://github.com/envoyproxy/envoy/pull/18686/files#diff-f8e00b71778057a3747a30c4fc3f3ad13e551cadebaf0eca9bdb476b42cadbcaR5112-R5117

@lambdai
Copy link
Copy Markdown
Contributor Author

lambdai commented Oct 20, 2021

Thanks! We should backport this obviously.

Thank you for the offline guidance and the quick approval!

@lambdai
Copy link
Copy Markdown
Contributor Author

lambdai commented Oct 20, 2021

/backport

Only need backport to 1.20

@repokitteh-read-only repokitteh-read-only bot added the backport/review Request to backport to stable releases label Oct 20, 2021
@mattklein123
Copy link
Copy Markdown
Member

Think about why adding back a fresh listener can duplicate that draining filter chain listener.
It is because the old inplace updated listener doesn't close the listen socket during the drain_timeout after the listener is told to remove.

Sorry I still don't understand. If you want to continue discussion on the other PR we can talk there, but I don't understand any scenario in which that PR will help. The real fix is cloning if we want to reuse a socket and not drop anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport/approved Approved backports to stable releases

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Envoy v1.20.0 slower to re-initialize dynamic listeners

3 participants