Skip to content

thrift_proxy: fix crash when remote closes the connection#6549

Merged
zuercher merged 6 commits intoenvoyproxy:masterfrom
rgs1:issue-6496
Apr 12, 2019
Merged

thrift_proxy: fix crash when remote closes the connection#6549
zuercher merged 6 commits intoenvoyproxy:masterfrom
rgs1:issue-6496

Conversation

@rgs1
Copy link
Member

@rgs1 rgs1 commented Apr 10, 2019

There's a few paths within the Thrift Proxy where we should ensure
the connection is not closed, before trying to write. This change
ensures that sendLocalReply() will return early if the connection
is gone.

It also adds a check for transformEnd(), which gets called from
upstreamData().

Risk Level: low
Testing: unit tests added
Fixes: #6496

Signed-off-by: Raul Gutierrez Segales rgs@pinterest.com

There's a few paths within the Thrift Proxy where we should ensure
the connection is not closed, before trying to write. This change
ensures that sendLocalReply() will return early if the connection
is gone.

It also adds a check for transformEnd(), which gets called from
upstreamData().

Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>
@rgs1 rgs1 requested a review from zuercher as a code owner April 10, 2019 22:45
@rgs1
Copy link
Member Author

rgs1 commented Apr 10, 2019

Will follow-up with tests in a bit.

Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>
@zuercher zuercher changed the title Fix crash when remote closes the connection thrift_proxy: fix crash when remote closes the connection Apr 11, 2019
@zuercher
Copy link
Member

@rgs1 this looks correct to me -- are you going to add a test for the ConnectionManager::ResponseDecoder::transportEnd case as well?

@rgs1
Copy link
Member Author

rgs1 commented Apr 11, 2019

@rgs1 this looks correct to me -- are you going to add a test for the ConnectionManager::ResponseDecoder::transportEnd case as well?

Yup, yup. For the first unit test tho, I am still figuring out why cx_destroy_remote_with_active_rq isn't increased... I'll dig in a bit more.

@zuercher
Copy link
Member

zuercher commented Apr 11, 2019

I believe cx_destroy_remote_with_active_rq isn't increased because ConnectionManager receives a full request and thinks it's transmitted a response:

  • ActiveRpc::sendLocalReply sets local_response_sent_ indicating that we shouldn't apply the filter chain to the remainder of the downstream request.
  • ActiveRpc::transportEnd is invoked when the end of processing the downstream's request is reached
  • ActiveRpc::finalizeRequest is called by transportEnd and detects local_response_sent_ is true and sets up deferred destroy for the active rpc.
  • ConnectionManager::onData invokes resetAllRpcs because of the end of stream, but there are no active RPCs so it doesn't increment the counter.

If the downstream request was partial (e.g. if the buffer didn't have a complete RPC message in it), ActiveRpc::transportEnd would not be called and ConnectionManager::resetAllRpcs would find incomplete RPCs and increment the counter.

But, small requests will fit in a buffer so this isn't just an artifact of the test. If you really want to count these as cx_destroy_remote_with_active_rq, I think you'd want to switch ActiveRpc::local_response_sent_ to be an enum with 3 values (not sent, sent successfully, send failed) and arrange for ConnectionManager::sendLocalReply to report the send success/failed to ActiveRpc::sendLocalReply so it can set local_response_sent_ correctly. Then in ActiveRpc::applyDecoderFilters you could return FilterStatus::Continue if local_response_sent_ indicates success and FilterStatus::StopIteration on failure. I believe that would cause ConnectionManager::onData to detect an incomplete RPC with closed connection and trigger the counter to be incremented.

This ensures we correctly handle the case when the remote downstream
already closed the connection.

Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>
@rgs1
Copy link
Member Author

rgs1 commented Apr 11, 2019

@zuercher thanks for the detailed explanation. I am not that crazy about another counter for this special-ish case... In the meanwhile, I'll remove my comment about that counter not being incremented.

@fishcakez @derekargueta thoughts?

Raul Gutierrez Segales added 2 commits April 11, 2019 14:52
Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>
Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>
Copy link
Contributor

@fishcakez fishcakez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we need to differentiate at that level if it adds complexity, and the current behavior around having seen transportEnd makes good sense. I left a couple of nits for other metrics though.

Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>
Copy link
Contributor

@fishcakez fishcakez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, can we merge it @zuercher ?

Copy link
Member

@zuercher zuercher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look good. Thanks for picking this up.

@zuercher zuercher merged commit d324731 into envoyproxy:master Apr 12, 2019
rgs1 pushed a commit to rgs1/envoy that referenced this pull request Jan 31, 2020
We've been using this in prod for a couple of months and for
a few different services.

A few critical bugs have been fixed (envoyproxy#9089 and envoyproxy#6549) and some
necessary stats/features have been added (envoyproxy#9203 and envoyproxy#8994), so
it's probably a good time to graduate this filter.

Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>
rgs1 pushed a commit to rgs1/envoy that referenced this pull request Feb 1, 2020
We've been using this in prod for a couple of months and for
a few different services.

A few critical bugs have been fixed (envoyproxy#9089 and envoyproxy#6549) and some
necessary stats/features have been added (envoyproxy#9203 and envoyproxy#8994), so
it's probably a good time to graduate at least the proxy
and the router filter.

Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>
mattklein123 pushed a commit that referenced this pull request Feb 1, 2020
We've been using this in prod for a couple of months and for
a few different services.

A few critical bugs have been fixed (#9089 and #6549) and some
necessary stats/features have been added (#9203 and #8994), so
it's probably a good time to graduate at least the proxy
and the router filter.

Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Segfault in thrift proxy

3 participants