udp: prevent crashing the envoy if udpListener is empty#11914
udp: prevent crashing the envoy if udpListener is empty#11914mattklein123 merged 16 commits intoenvoyproxy:masterfrom
Conversation
508033b to
b1d65fe
Compare
include/envoy/network/filter.h
Outdated
There was a problem hiding this comment.
Did you consider not to add interface but to change udpListener to return optional reference?
There was a problem hiding this comment.
Did you consider not to add interface but to change
udpListenerto return optional reference?
Yes. I also considered about that but there are some pros and cons.
Followings are options that I considered.
-
Change the udpListener's signature to return a pointer not reference.
pros : It is easy to check the return value of the method is null or not so added code is less than current patch set.
cons : Need to change every code that call the udpListener and need to change the api's signature that can cause compatibility issues.
The udpListener ensure that the returned value does not have null(I think that is an intention but actually dose not. :( ) so I need to add some code for nullity checks where the udpListener is called. -
Change the udpListener's signature to return a optional reference.
Same as first one. -
Add a new interface to check the nullity.
pros : It does not touch any previous behaviors so there are no compatibility issues and integration issues.
As I mentioned that the udpListener returns reference so that it ensures returned value is not a null but in the some cases does not. I think that the returned value of udpListener almost has valid value. So, it is better to check if returned value is null or not in special cases, I think.
cons : As I mentioned, a new method should be added in special case.
So, I choose the third option.
Please give your opinions if there are any good options and fix my assumption if it is wrong. :)
There was a problem hiding this comment.
An optional reference isn't allowed (by the language) unfortunately. I think option 1 is the correct choice. There's not that much code to change, and I'm not worried about compatbility.
There was a problem hiding this comment.
An optional reference isn't allowed (by the language) unfortunately. I think option 1 is the correct choice. There's not that much code to change, and I'm not worried about compatbility.
Ok. I got your point. I will revise my patch. :)
|
/retest |
|
🤷♀️ nothing to rebuild. |
094be5d to
49a51a0
Compare
|
By my fault, more reviewers were added in my PR. :( |
fixed |
|
/wait |
alyssawilk
left a comment
There was a problem hiding this comment.
Hey @chadr123
Thanks for picking up this bug! We should definitely handle the case where upstream packets are sent after a listener is torn down, but I think we're going to either
- keep the listener open for a draining period (maybe not listening for new packets, but make sure the socket is valid for writing) or
- tear down sessions using the listener when the listener is torn down. If we can't receive downstream packets and we can't forward upstream packets, we should just tear down the lingering filters.
WDYT?
cc @danzh2010
@alyssawilk Actually, I think that this bug is caused by design issue so I don't know I should redesign the envoy. :( |
|
I don't think this is a core-vs-extension issue so much as a TCP vs UDP issue. For TCP, when the listening socket closes, the individual file descriptors that connections are reading and writing from remain valid. For UDP, the "listening socket" is also the reading/writing socket. I think the simplest way to fix this that's consistent with the rest of Envoy APIs is to extend UdpListenerCallbacks to include an onListenerDestroyed or some such, and tear down state then. |
Ok. There are no core-vs-extension issues if I add a new interface for Udp case. |
Ok. I have tested as you mentioned and it works fine. There is no crash. :) |
Signed-off-by: DongRyeol Cha <dr83.cha@samsung.com>
cool! Plz add a regression test for this fix! |
Signed-off-by: DongRyeol Cha <dr83.cha@samsung.com>
@danzh2010 |
| namespace Server { | ||
| namespace { | ||
|
|
||
| class MockUpstreamUdpFilter : public Network::UdpListenerReadFilter { |
There was a problem hiding this comment.
No interface is mocked. rename it to: UnimplementedUpstreamUdpFilter?
There was a problem hiding this comment.
Ok. It is better to rename it to other one.
| void shutdownListener() override { | ||
| // The read_filter_ should be deleted before the udp_listener_ is deleted. | ||
| if (read_filter_.get() != nullptr && udp_listener_.get() == nullptr) { | ||
| throw Envoy::EnvoyException( |
There was a problem hiding this comment.
Does it have to be an exception? If it's just added for the sake of test, I think it's better to written the test in a way that without read_filter_.reset() the test would crash.
There was a problem hiding this comment.
Ok. I will change that code and the test to as you mentioned. :)
Signed-off-by: DongRyeol Cha <dr83.cha@samsung.com>
Signed-off-by: DongRyeol Cha <dr83.cha@samsung.com>
|
@danzh2010 |
Signed-off-by: DongRyeol Cha <dr83.cha@samsung.com>
| EXPECT_CALL(*listener, onDestroy()); | ||
| } | ||
|
|
||
| // The read_filter should be deleted before the udp_listener is deleted. |
There was a problem hiding this comment.
Can you add a UDP proxy filter test to reproduce the crash you encountered and verify that the crash is gone with the fix?
There was a problem hiding this comment.
I think that the read filter is reset so that any packets cannot be received that send by upstream by the my fix.
I can add a UDP proxy filter test to reproduce the crash I encountered but it is hard to verify that the crash is gone on UDP proxy filter test because there is no framework or method to simulate it.
As you know that it is simulated by call the file_event_cb_ directly.
But if the read filter is reset, the file event callback that the read filter have is never cannot be called.
Do you have any ideas to simulate without call the file_event_cb_ directly on the UDP proxy filter test?
I have no ideas for that. :(
There was a problem hiding this comment.
How about any integration test which don't simulate file_event_cb_ but actually sends packet to that socket? @mattklein123 about ideas of how to write regression test for this crash as Matt knows better about that test code.
There was a problem hiding this comment.
I checked the integration test but I cannot find a way to test my patch.
As you know that the root cause was race condition between read filter and udp listener.
So, if we want to test the before and after, we should control the deletion of read filter and udp listener.
But I cannot find the framework or a way for that. :(
That is way I check the order of read filter and udp listener deletions.
Do I have to really add more tests for this patch?
|
Dan, I'm headed out for vacation tomorrow - can you /assign matt when you've signed off? |
|
Sorry for the delay. I will take a look at the PR and some testing ideas tomorrow. |
| void resumeListening() override { udp_listener_->enable(); } | ||
| void shutdownListener() override { udp_listener_.reset(); } | ||
| void shutdownListener() override { | ||
| // The read_filter_ should be deleted before the udp_listener_ is deleted. |
There was a problem hiding this comment.
@danzh2010 @chadr123 sorry I haven't been tracking this PR. Can you describe in more detail what the actual problem is? I would like to understand this better, which will help build a test, but also make this comment better. My very quick understanding is that this is the line that is crashing? Is that right?
So the issue is a race condition in which the listener is removed, but before we destroy the filter during deferred deletion. Is that right? So the fix here is to make sure that when the listener is shutdown/destroyed the filter is also destroyed?
Yeah it might not be possible to have an integration test for this case but I will determine that once we make this comment more complete in terms of what is going on here. Thank you!
/wait
There was a problem hiding this comment.
Yes. All your understanding is right.
So, the filter should be deleted before listener deletion because the filter refers to the listener.
There was a problem hiding this comment.
Ok thanks. I agree I don't see how it will be possible to make an integration test for this. Please make the comment more robust and we can ship. Thank you!
There was a problem hiding this comment.
How about this?
| // The read_filter_ should be deleted before the udp_listener_ is deleted. | |
| // The read filter should be deleted before the udp listener is deleted. | |
| // The read filter refers to udp listener to send packets to downstream that comes from upstream. | |
| // So, if udp listener is deleted first than read filter, the read filter try to send packets to downstream but there is no udp listener. | |
| // In that case, null pointer referencing will be occurred. | |
| // This is serious in case of hot restart because if previous envoy process is crashed by this null pointer referencing, new envoy process that started for hot restart will be also stopped because it cannot signal to previous envoy process. |
There was a problem hiding this comment.
Let's do this (reflow as appropriate):
// The read filter should be deleted before the UDP listener is deleted.
// The read filter refers to the UDP listener to send packets to downstream.
// If the UDP listener is deleted before the read filter, the read filter may try to use it after deletion.
Thank you!
There was a problem hiding this comment.
Ok. I have changed the comments now. :)
Signed-off-by: DongRyeol Cha <dr83.cha@samsung.com>
…1914) This change delete the read filter before udp listener deletion. Signed-off-by: DongRyeol Cha <dr83.cha@samsung.com>
…1914) This change delete the read filter before udp listener deletion. Signed-off-by: DongRyeol Cha <dr83.cha@samsung.com> Signed-off-by: chaoqinli <chaoqinli@google.com>
This bug fix release note should be added on envoyproxy#11914 but is was missed. Signed-off-by: DongRyeol Cha <dr83.cha@samsung.com>
Commit Message: I have observed that the envoy sometimes crashed if I configure the envoy
to use the udp proxy.
It is occurred when the hot restarting case.
Let's suppose that there are three upstreams so that envoy has three upstreams.
For some reason, the first upstream is crashed so the k8s removed the dead pod.
In that case, I removed the dead pod from envoy and triggered hot restart without dead pod.
But the dead pod actually not dead.
So, it sent some datagram packets to envoy.
In that case, the envoy does not have that upstream so there is no udp listener.
But the envoy willing to send the datagram packets to downstream even if there is
no proper udp listener so the NPE occurred.
This change delete the read filter before udp listener deletion.
Signed-off-by: DongRyeol Cha dr83.cha@samsung.com
Risk Level: Low
Testing: bazel test
Docs Changes: N/A
Release Notes: N/A