Skip to content

udp: prevent crashing the envoy if udpListener is empty#11914

Merged
mattklein123 merged 16 commits intoenvoyproxy:masterfrom
chadr123:fix_udp_proxy_crash
Aug 6, 2020
Merged

udp: prevent crashing the envoy if udpListener is empty#11914
mattklein123 merged 16 commits intoenvoyproxy:masterfrom
chadr123:fix_udp_proxy_crash

Conversation

@chadr123
Copy link
Contributor

@chadr123 chadr123 commented Jul 7, 2020

Commit Message: I have observed that the envoy sometimes crashed if I configure the envoy
to use the udp proxy.

It is occurred when the hot restarting case.
Let's suppose that there are three upstreams so that envoy has three upstreams.

For some reason, the first upstream is crashed so the k8s removed the dead pod.
In that case, I removed the dead pod from envoy and triggered hot restart without dead pod.

But the dead pod actually not dead.
So, it sent some datagram packets to envoy.

In that case, the envoy does not have that upstream so there is no udp listener.
But the envoy willing to send the datagram packets to downstream even if there is
no proper udp listener so the NPE occurred.

This change delete the read filter before udp listener deletion.

Signed-off-by: DongRyeol Cha dr83.cha@samsung.com

Risk Level: Low
Testing: bazel test
Docs Changes: N/A
Release Notes: N/A

@chadr123 chadr123 requested a review from mattklein123 as a code owner July 7, 2020 04:44
@chadr123 chadr123 force-pushed the fix_udp_proxy_crash branch from 508033b to b1d65fe Compare July 7, 2020 08:04
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider not to add interface but to change udpListener to return optional reference?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider not to add interface but to change udpListener to return optional reference?

Yes. I also considered about that but there are some pros and cons.
Followings are options that I considered.

  1. Change the udpListener's signature to return a pointer not reference.
    pros : It is easy to check the return value of the method is null or not so added code is less than current patch set.
    cons : Need to change every code that call the udpListener and need to change the api's signature that can cause compatibility issues.
    The udpListener ensure that the returned value does not have null(I think that is an intention but actually dose not. :( ) so I need to add some code for nullity checks where the udpListener is called.

  2. Change the udpListener's signature to return a optional reference.
    Same as first one.

  3. Add a new interface to check the nullity.
    pros : It does not touch any previous behaviors so there are no compatibility issues and integration issues.
    As I mentioned that the udpListener returns reference so that it ensures returned value is not a null but in the some cases does not. I think that the returned value of udpListener almost has valid value. So, it is better to check if returned value is null or not in special cases, I think.
    cons : As I mentioned, a new method should be added in special case.

So, I choose the third option.
Please give your opinions if there are any good options and fix my assumption if it is wrong. :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it clearly. thanks! 🙏

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An optional reference isn't allowed (by the language) unfortunately. I think option 1 is the correct choice. There's not that much code to change, and I'm not worried about compatbility.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An optional reference isn't allowed (by the language) unfortunately. I think option 1 is the correct choice. There's not that much code to change, and I'm not worried about compatbility.

Ok. I got your point. I will revise my patch. :)

@ggreenway ggreenway self-assigned this Jul 7, 2020
@chadr123
Copy link
Contributor Author

chadr123 commented Jul 8, 2020

/retest

@repokitteh-read-only
Copy link

🤷‍♀️ nothing to rebuild.

🐱

Caused by: a #11914 (comment) was created by @chadr123.

see: more, trace.

@chadr123
Copy link
Contributor Author

chadr123 commented Jul 8, 2020

By my fault, more reviewers were added in my PR. :(
How can I remove reviewers except mandatory reviewers?

@ggreenway
Copy link
Member

By my fault, more reviewers were added in my PR. :(
How can I remove reviewers except mandatory reviewers?

fixed

@ggreenway
Copy link
Member

/wait

Copy link
Contributor

@alyssawilk alyssawilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @chadr123
Thanks for picking up this bug! We should definitely handle the case where upstream packets are sent after a listener is torn down, but I think we're going to either

  1. keep the listener open for a draining period (maybe not listening for new packets, but make sure the socket is valid for writing) or
  2. tear down sessions using the listener when the listener is torn down. If we can't receive downstream packets and we can't forward upstream packets, we should just tear down the lingering filters.
    WDYT?
    cc @danzh2010

@chadr123
Copy link
Contributor Author

chadr123 commented Jul 10, 2020

Hey @chadr123
Thanks for picking up this bug! We should definitely handle the case where upstream packets are sent after a listener is torn down, but I think we're going to either

  1. keep the listener open for a draining period (maybe not listening for new packets, but make sure the socket is valid for writing) or
  2. tear down sessions using the listener when the listener is torn down. If we can't receive downstream packets and we can't forward upstream packets, we should just tear down the lingering filters.
    WDYT?
    cc @danzh2010

@alyssawilk
I think your suggestion also makes sense but to do that, the ActiveRawUdpListener should know the ActiveSession that is an extension not core module. The ActiveRawUdpListener is belonging to Envoy::Server namespace but the ActiveSession is belonging to Envoy::Extensions::UdpFilters::UdpProxy namespace.
So, I think that it is not a good design.

Actually, I think that this bug is caused by design issue so I don't know I should redesign the envoy. :(
Please give your opinions. :)

@ggreenway ggreenway assigned alyssawilk and unassigned ggreenway Jul 10, 2020
@alyssawilk
Copy link
Contributor

I don't think this is a core-vs-extension issue so much as a TCP vs UDP issue.

For TCP, when the listening socket closes, the individual file descriptors that connections are reading and writing from remain valid. For UDP, the "listening socket" is also the reading/writing socket.

I think the simplest way to fix this that's consistent with the rest of Envoy APIs is to extend UdpListenerCallbacks to include an onListenerDestroyed or some such, and tear down state then.

@chadr123
Copy link
Contributor Author

I don't think this is a core-vs-extension issue so much as a TCP vs UDP issue.

For TCP, when the listening socket closes, the individual file descriptors that connections are reading and writing from remain valid. For UDP, the "listening socket" is also the reading/writing socket.

I think the simplest way to fix this that's consistent with the rest of Envoy APIs is to extend UdpListenerCallbacks to include an onListenerDestroyed or some such, and tear down state then.

Ok. There are no core-vs-extension issues if I add a new interface for Udp case.
I will revise my patch soon. Thanks!!

@chadr123
Copy link
Contributor Author

I personally don't like the idea of tracking all the ActiveSessions as UdpPacketProcessors in UdpListener. The UdpPacketProcessor interface of UdpProxyFilter::ActiveSession is used for upstream IO. While UdpListener shouldn't have to know the details about its ListenerFilter instances.

Can you instead destroy ActiveRawUdpListener:: read_filter_ when UdpListener is destructed? Maybe in shutdownListener() call read_filter_.reset() and then udp_listener_.reset()

Ok. I have tested as you mentioned and it works fine. There is no crash. :)
I will revise my patch soon!!

Signed-off-by: DongRyeol Cha <dr83.cha@samsung.com>
@danzh2010
Copy link
Contributor

I personally don't like the idea of tracking all the ActiveSessions as UdpPacketProcessors in UdpListener. The UdpPacketProcessor interface of UdpProxyFilter::ActiveSession is used for upstream IO. While UdpListener shouldn't have to know the details about its ListenerFilter instances.
Can you instead destroy ActiveRawUdpListener:: read_filter_ when UdpListener is destructed? Maybe in shutdownListener() call read_filter_.reset() and then udp_listener_.reset()

Ok. I have tested as you mentioned and it works fine. There is no crash. :)
I will revise my patch soon!!

cool! Plz add a regression test for this fix!

DongRyeol Cha added 2 commits July 27, 2020 20:31
Signed-off-by: DongRyeol Cha <dr83.cha@samsung.com>
Signed-off-by: DongRyeol Cha <dr83.cha@samsung.com>
@chadr123
Copy link
Contributor Author

I personally don't like the idea of tracking all the ActiveSessions as UdpPacketProcessors in UdpListener. The UdpPacketProcessor interface of UdpProxyFilter::ActiveSession is used for upstream IO. While UdpListener shouldn't have to know the details about its ListenerFilter instances.
Can you instead destroy ActiveRawUdpListener:: read_filter_ when UdpListener is destructed? Maybe in shutdownListener() call read_filter_.reset() and then udp_listener_.reset()

Ok. I have tested as you mentioned and it works fine. There is no crash. :)
I will revise my patch soon!!

cool! Plz add a regression test for this fix!

@danzh2010
Hi, I have added a regression test for that. :)

namespace Server {
namespace {

class MockUpstreamUdpFilter : public Network::UdpListenerReadFilter {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No interface is mocked. rename it to: UnimplementedUpstreamUdpFilter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. It is better to rename it to other one.

void shutdownListener() override {
// The read_filter_ should be deleted before the udp_listener_ is deleted.
if (read_filter_.get() != nullptr && udp_listener_.get() == nullptr) {
throw Envoy::EnvoyException(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it have to be an exception? If it's just added for the sake of test, I think it's better to written the test in a way that without read_filter_.reset() the test would crash.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I will change that code and the test to as you mentioned. :)

DongRyeol Cha added 3 commits July 29, 2020 14:08
Signed-off-by: DongRyeol Cha <dr83.cha@samsung.com>
Signed-off-by: DongRyeol Cha <dr83.cha@samsung.com>
Signed-off-by: DongRyeol Cha <dr83.cha@samsung.com>
@chadr123
Copy link
Contributor Author

@danzh2010
Hi, I have revised my patch again. :)

Signed-off-by: DongRyeol Cha <dr83.cha@samsung.com>
EXPECT_CALL(*listener, onDestroy());
}

// The read_filter should be deleted before the udp_listener is deleted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a UDP proxy filter test to reproduce the crash you encountered and verify that the crash is gone with the fix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the read filter is reset so that any packets cannot be received that send by upstream by the my fix.
I can add a UDP proxy filter test to reproduce the crash I encountered but it is hard to verify that the crash is gone on UDP proxy filter test because there is no framework or method to simulate it.
As you know that it is simulated by call the file_event_cb_ directly.
But if the read filter is reset, the file event callback that the read filter have is never cannot be called.

Do you have any ideas to simulate without call the file_event_cb_ directly on the UDP proxy filter test?
I have no ideas for that. :(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about any integration test which don't simulate file_event_cb_ but actually sends packet to that socket? @mattklein123 about ideas of how to write regression test for this crash as Matt knows better about that test code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the integration test but I cannot find a way to test my patch.
As you know that the root cause was race condition between read filter and udp listener.
So, if we want to test the before and after, we should control the deletion of read filter and udp listener.
But I cannot find the framework or a way for that. :(
That is way I check the order of read filter and udp listener deletions.
Do I have to really add more tests for this patch?

@alyssawilk
Copy link
Contributor

Dan, I'm headed out for vacation tomorrow - can you /assign matt when you've signed off?

@mattklein123 mattklein123 self-assigned this Aug 5, 2020
@mattklein123
Copy link
Member

Sorry for the delay. I will take a look at the PR and some testing ideas tomorrow.

void resumeListening() override { udp_listener_->enable(); }
void shutdownListener() override { udp_listener_.reset(); }
void shutdownListener() override {
// The read_filter_ should be deleted before the udp_listener_ is deleted.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danzh2010 @chadr123 sorry I haven't been tracking this PR. Can you describe in more detail what the actual problem is? I would like to understand this better, which will help build a test, but also make this comment better. My very quick understanding is that this is the line that is crashing? Is that right?

const Api::IoCallUint64Result rc = cluster_.filter_.read_callbacks_->udpListener().send(data);

So the issue is a race condition in which the listener is removed, but before we destroy the filter during deferred deletion. Is that right? So the fix here is to make sure that when the listener is shutdown/destroyed the filter is also destroyed?

Yeah it might not be possible to have an integration test for this case but I will determine that once we make this comment more complete in terms of what is going on here. Thank you!

/wait

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. All your understanding is right.
So, the filter should be deleted before listener deletion because the filter refers to the listener.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok thanks. I agree I don't see how it will be possible to make an integration test for this. Please make the comment more robust and we can ship. Thank you!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this?

Suggested change
// The read_filter_ should be deleted before the udp_listener_ is deleted.
// The read filter should be deleted before the udp listener is deleted.
// The read filter refers to udp listener to send packets to downstream that comes from upstream.
// So, if udp listener is deleted first than read filter, the read filter try to send packets to downstream but there is no udp listener.
// In that case, null pointer referencing will be occurred.
// This is serious in case of hot restart because if previous envoy process is crashed by this null pointer referencing, new envoy process that started for hot restart will be also stopped because it cannot signal to previous envoy process.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do this (reflow as appropriate):

// The read filter should be deleted before the UDP listener is deleted.
// The read filter refers to the UDP listener to send packets to downstream.
// If the UDP listener is deleted before the read filter, the read filter may try to use it after deletion.

Thank you!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I have changed the comments now. :)

Signed-off-by: DongRyeol Cha <dr83.cha@samsung.com>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@mattklein123 mattklein123 merged commit 8a7ba1f into envoyproxy:master Aug 6, 2020
@chadr123 chadr123 deleted the fix_udp_proxy_crash branch August 6, 2020 23:19
chaoqin-li1123 pushed a commit to chaoqin-li1123/envoy that referenced this pull request Aug 7, 2020
…1914)

This change delete the read filter before udp listener deletion.

Signed-off-by: DongRyeol Cha <dr83.cha@samsung.com>
chaoqin-li1123 pushed a commit to chaoqin-li1123/envoy that referenced this pull request Aug 7, 2020
…1914)

This change delete the read filter before udp listener deletion.

Signed-off-by: DongRyeol Cha <dr83.cha@samsung.com>
Signed-off-by: chaoqinli <chaoqinli@google.com>
chadr123 pushed a commit to chadr123/envoy that referenced this pull request Sep 5, 2020
This bug fix release note should be added on envoyproxy#11914 but is was missed.

Signed-off-by: DongRyeol Cha <dr83.cha@samsung.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants