Skip to content

alts: Fix TsiSocket doWrite on outstanding handshake bytes handling#16488

Merged
antoniovicente merged 13 commits intoenvoyproxy:mainfrom
yihuazhang:tsi_write_fix_new
Jun 16, 2021
Merged

alts: Fix TsiSocket doWrite on outstanding handshake bytes handling#16488
antoniovicente merged 13 commits intoenvoyproxy:mainfrom
yihuazhang:tsi_write_fix_new

Conversation

@yihuazhang
Copy link
Copy Markdown
Contributor

@yihuazhang yihuazhang commented May 13, 2021

TsiSocket doWrite() incorrectly handles outstanding handshake bytes produced after handshake is marked as complete. This PR fixes the issue.

Details: when the handshake is complete in doHandshakeNextDone(), it calls raiseEvent(Network::ConnectionEvent::Connected) which triggers TsiSocket::doWrite(). The doWrite() call then enters repeatProtectAndWrite() with outstanding handshake bytes in raw_write_buffer_, which violates the invariant.

Signed-off-by: yihuaz <yihuaz@google.com>
@yihuazhang yihuazhang force-pushed the tsi_write_fix_new branch from 8f2cad7 to 4dd39ad Compare May 13, 2021 19:45
Copy link
Copy Markdown
Contributor

@antoniovicente antoniovicente left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick debug and fix.

// Check if we need to flush outstanding handshake bytes.
if (write_buffer_contains_handshake_bytes_) {
ASSERT(raw_write_buffer_.length() > 0);
if (raw_write_buffer_.length() > 0 && prev_bytes_to_drain_ == 0) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful to add tests to cover the case where doHandshakeNextDone adds bytes and how this version behaves better than the case where we only set "write_buffer_contains_handshake_bytes_ = true" in one special case.

Signed-off-by: yihuaz <yihuaz@google.com>
@yihuazhang yihuazhang force-pushed the tsi_write_fix_new branch from 1f00a25 to c62dfa2 Compare May 17, 2021 22:44
@yihuazhang
Copy link
Copy Markdown
Contributor Author

Regarding test, we have DoWriteOutstandingHandshakeData that covers the case when we finish a handshake with outstanding handshake bytes in raw_write_buffer_.

@antoniovicente
Copy link
Copy Markdown
Contributor

Regarding test, we have DoWriteOutstandingHandshakeData that covers the case when we finish a handshake with outstanding handshake bytes in raw_write_buffer_.

The existing test didn't catch the regression introduced by #15962, so I think we have some gaps in test coverage that we should address in this PR.

Network::IoResult result = raw_buffer_socket_->doWrite(raw_write_buffer_, false);
if (handshake_complete_ && raw_write_buffer_.length() > 0) {
write_buffer_contains_handshake_bytes_ = true;
if (handshake_complete_) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An error could be detected on doWrite which could cause the connection to be closed. I think that this raiseEvent could then cause a crash if the connection is already closed.

Similarly, the call to doWrite after raiseEvent(Connected) needs to check the status of the connection. If the raiseEvent(Connected) closes the connection the doWrite may run into some trouble. IIRC close on raiseEvent(Connected) does happen in the case of SSL sockets.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which doWrite do you mean in "Similarly, the call to doWrite after raiseEvent(Connected) needs to check the status of the connection"? We do not call doWrite after raiseEvent in this PR.

Signed-off-by: yihuaz <yihuaz@google.com>
@yihuazhang yihuazhang force-pushed the tsi_write_fix_new branch from d667d33 to ec665cd Compare May 17, 2021 23:27
@yihuazhang
Copy link
Copy Markdown
Contributor Author

Regarding test, we have DoWriteOutstandingHandshakeData that covers the case when we finish a handshake with outstanding handshake bytes in raw_write_buffer_.

The existing test didn't catch the regression introduced by #15962, so I think we have some gaps in test coverage that we should address in this PR.

The reason why we did not catch the regression is raiseEvent is mocked in MockTransportSocketCallbacks and I do not know if there is an easy way to unmock it. WDYT?

@antoniovicente
Copy link
Copy Markdown
Contributor

Regarding test, we have DoWriteOutstandingHandshakeData that covers the case when we finish a handshake with outstanding handshake bytes in raw_write_buffer_.

The existing test didn't catch the regression introduced by #15962, so I think we have some gaps in test coverage that we should address in this PR.

The reason why we did not catch the regression is raiseEvent is mocked in MockTransportSocketCallbacks and I do not know if there is an easy way to unmock it. WDYT?

I recommend an integration test that uses tsi socket from the test client. A write down this socket before handshake is complete may be able to repo the failure scenario that we saw on our internal extensions when tsi socket was updated.

@yihuazhang
Copy link
Copy Markdown
Contributor Author

Regarding test, we have DoWriteOutstandingHandshakeData that covers the case when we finish a handshake with outstanding handshake bytes in raw_write_buffer_.

The existing test didn't catch the regression introduced by #15962, so I think we have some gaps in test coverage that we should address in this PR.

The reason why we did not catch the regression is raiseEvent is mocked in MockTransportSocketCallbacks and I do not know if there is an easy way to unmock it. WDYT?

I recommend an integration test that uses tsi socket from the test client. A write down this socket before handshake is complete may be able to repo the failure scenario that we saw on our internal extensions when tsi socket was updated.

Nice suggestion! Let me play around with the integration test to see if we can reproduce it.

// There should be no handshake bytes in raw_write_buffer_.
ASSERT(!(raw_write_buffer_.length() > 0 && prev_bytes_to_drain_ == 0));
while (true) {
uint64_t bytes_to_drain_this_iteration =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See https://github.com/envoyproxy/envoy/pull/16514/files

Could you merge upstream/main and consider changing the "std::min<size_t>(" to "std::min<uint64_t>("

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing. Done.

@yihuazhang yihuazhang force-pushed the tsi_write_fix_new branch from 2c71305 to 713d4c4 Compare May 19, 2021 21:16
@yihuazhang
Copy link
Copy Markdown
Contributor Author

yihuazhang commented May 19, 2021

Regarding test, we have DoWriteOutstandingHandshakeData that covers the case when we finish a handshake with outstanding handshake bytes in raw_write_buffer_.

The existing test didn't catch the regression introduced by #15962, so I think we have some gaps in test coverage that we should address in this PR.

The reason why we did not catch the regression is raiseEvent is mocked in MockTransportSocketCallbacks and I do not know if there is an easy way to unmock it. WDYT?

I recommend an integration test that uses tsi socket from the test client. A write down this socket before handshake is complete may be able to repo the failure scenario that we saw on our internal extensions when tsi socket was updated.

Nice suggestion! Let me play around with the integration test to see if we can reproduce it.

@antoniovicente, I tried to add some data to client's buffer before starting an ALTS connection, and it seems one place where we can add write() API is before this API returns in makeRawHttpConnection() in http_integration.cc

Then, I got stuck on how to add a partial HTTP request via the above write API because it seems an HTTP request is encoded via an RequestEncoder which does not provide an API that returns the final string to be sent to the peer. In other words, it does not provide an API that we can use to manipulate the HTTP request. Ideally, if there are APIs that can send/receive a raw string over the ALTS connection, we can easily use them to test the above regression. Could you please provide some guidance on it?

@antoniovicente
Copy link
Copy Markdown
Contributor

Regarding test, we have DoWriteOutstandingHandshakeData that covers the case when we finish a handshake with outstanding handshake bytes in raw_write_buffer_.

The existing test didn't catch the regression introduced by #15962, so I think we have some gaps in test coverage that we should address in this PR.

The reason why we did not catch the regression is raiseEvent is mocked in MockTransportSocketCallbacks and I do not know if there is an easy way to unmock it. WDYT?

I recommend an integration test that uses tsi socket from the test client. A write down this socket before handshake is complete may be able to repo the failure scenario that we saw on our internal extensions when tsi socket was updated.

Nice suggestion! Let me play around with the integration test to see if we can reproduce it.

@antoniovicente, I tried to add some data to client's buffer before starting an ALTS connection, and it seems one place where we can add write() API is before this API returns in makeRawHttpConnection() in http_integration.cc

Then, I got stuck on how to add a partial HTTP request via the above write API because it seems an HTTP request is encoded via an RequestEncoder which does not provide an API that returns the final string to be sent to the peer. In other words, it does not provide an API that we can use to manipulate the HTTP request. Ideally, if there are APIs that can send/receive a raw string over the ALTS connection, we can easily use them to test the above regression. Could you please provide some guidance on it?

BaseIntegrationTest::sendRawHttpAndWaitForResponse seems like a possible starting point. The RawConnectionDriver does a write when onEvent(Network::ConnectionEvent::Connected) is delivered by the transport.

Signed-off-by: yihuaz <yihuaz@google.com>
@yihuazhang yihuazhang force-pushed the tsi_write_fix_new branch from e905808 to b3917a1 Compare May 25, 2021 16:22
@yihuazhang
Copy link
Copy Markdown
Contributor Author

The path used when sending raw bytes on the connection doesn't use the codec. I think that your prior attempt ended up creating 2 connections to the proxy, but only 1 of the two used Alts, the other was a raw TCP connection so we ended up sending an HTTP request instead of an ALTS handshake to the ALTS listener.

Ah it makes sense. It could have taken me a while to figure this out. +1 for adding the check to ensure codecs are not created when using raw connections.

@antoniovicente
Copy link
Copy Markdown
Contributor

/retest

@repokitteh-read-only
Copy link
Copy Markdown

Retrying Azure Pipelines:

🐱

Caused by: a #16488 (comment) was created by @antoniovicente.

see: more, trace.

@antoniovicente
Copy link
Copy Markdown
Contributor

I'm not sure why tests don't seem to be running. Could you merge upstream/main ?

@yihuazhang yihuazhang force-pushed the tsi_write_fix_new branch from 9580721 to 58e626a Compare May 26, 2021 18:07
Signed-off-by: yihuaz <yihuaz@google.com>
@yihuazhang yihuazhang force-pushed the tsi_write_fix_new branch from 48d72ea to d3aa25f Compare May 26, 2021 19:27
@antoniovicente
Copy link
Copy Markdown
Contributor

Please look at the TSAN/ASAN CI results. There seems to be a test-only bug in AltsIntegrationTestValidPeer.RouterRequestAndResponseWithBodyRawHttp which results in use-after-free.

@yihuazhang
Copy link
Copy Markdown
Contributor Author

yihuazhang commented May 27, 2021

The use-after-free error happened when invoking verifyActualFrameSizeToUse on client_tsi_socket_, which could be nullptr as we already transferred its ownership to createClientConnection().

@yihuazhang
Copy link
Copy Markdown
Contributor Author

@antoniovicente Do you know the version of gRPC envoy currently depends on? I wonder if it includes grpc/grpc#25621. If it does, we can verify the correctness of max_frame_size in the same way we verify other fields (i.e., rpc_versions) in the client_start and server_start handshake messages. Although it will be a little different in a sense that we currently verify the negotiated frame size (i.e., actual_frame_size_to_use_) while the aforementioned test verifies max_frame_size provided in a handshake message.

I can not think of a better way of verifying actual_frame_size_to_use_ other than creating a static member variable in tsi_socket class that stores actual_frame_size_to_use_ and verify its value in the test (need to reset it for each test case). It is pretty ugly, and I prefer not do so.

So if the current gRPC envoy depends on does not include grpc/grpc#25621, I prefer to add a TODO and add the aforementioned test once the PR gets included. We should also remove the use of verifyActualFrameSizeToUse() in the tests since it has the problem of use-after-free.

PLMK what you think.

@antoniovicente
Copy link
Copy Markdown
Contributor

You can see the current grpc dependency here. We're currently building against a grpc library that is 6 months old. I'm trying to get the dependency updated.

@yihuazhang
Copy link
Copy Markdown
Contributor Author

SG. PLMK when the update is complete, and I will update the test.

@antoniovicente
Copy link
Copy Markdown
Contributor

grpc version PR: #16687

@antoniovicente
Copy link
Copy Markdown
Contributor

The PR to update grpc deps just landed, please git merge upstream/main and look into addressing the remaining tsan/asan issues.

Signed-off-by: yihuaz <yihuaz@google.com>
Signed-off-by: yihuaz <yihuaz@google.com>
@yihuazhang
Copy link
Copy Markdown
Contributor Author

/retest

@repokitteh-read-only
Copy link
Copy Markdown

Retrying Azure Pipelines:
Check envoy-presubmit isn't fully completed, but will still attempt retrying.
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #16488 (comment) was created by @yihuazhang.

see: more, trace.

@antoniovicente
Copy link
Copy Markdown
Contributor

/retest

since this PR doesn't change docs inputs.

@repokitteh-read-only
Copy link
Copy Markdown

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #16488 (comment) was created by @antoniovicente.

see: more, trace.

Copy link
Copy Markdown
Contributor

@antoniovicente antoniovicente left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@antoniovicente antoniovicente merged commit f531442 into envoyproxy:main Jun 16, 2021
leyao-daily pushed a commit to leyao-daily/envoy that referenced this pull request Sep 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants