Skip to content

hcm: handle stream reset during downstream encoding#19144

Merged
alyssawilk merged 9 commits intoenvoyproxy:mainfrom
danzh2010:fixhcm
Dec 7, 2021
Merged

hcm: handle stream reset during downstream encoding#19144
alyssawilk merged 9 commits intoenvoyproxy:mainfrom
danzh2010:fixhcm

Conversation

@danzh2010
Copy link
Copy Markdown
Contributor

@danzh2010 danzh2010 commented Nov 30, 2021

Signed-off-by: Dan Zhang danzh@google.com

Commit Message: when the downstream codec encodes response header/body/trailers in the same call stack as upstream decoding, which is the case for Http3, a connection could gets closed which will cause the stream to be reset. In such case, FilterManagerImpl shouldn't continue process the rest of the response, especially end_stream.

Additional Description: this contributes to the crash in #19006. This fix stops the crash, but didn't make the test pass.

Risk Level: low
Testing: new unit tests
Release Notes: guarded by envoy.reloadable_features.handle_stream_reset_during_hcm_encoding

Signed-off-by: Dan Zhang <danzh@google.com>
Signed-off-by: Dan Zhang <danzh@google.com>
Signed-off-by: Dan Zhang <danzh@google.com>
@danzh2010
Copy link
Copy Markdown
Contributor Author

/retest

@repokitteh-read-only
Copy link
Copy Markdown

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #19144 (comment) was created by @danzh2010.

see: more, trace.

Signed-off-by: Dan Zhang <danzh@google.com>
Copy link
Copy Markdown
Contributor

@alyssawilk alyssawilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd definitely like to see this integration tested.
I believe we can enable DISABLED_AdminDrainDrainsListeners with this change, but also with some potential improvements to get the new HandleSocketFail test to work for HTTP/3?

Copy link
Copy Markdown
Contributor

@alyssawilk alyssawilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also one high level question for @mattklein123

The problem here is we're writing to a closed socket, which causes us to close the connection, and reset the stream under the stack of a write, and then we were processing end stream.

I wonder if instead of immediate connection close, we'd be better off scheduling the close for the next loop to avoid closing the downstream connection under the stack of the upstream connection?

@danzh2010
Copy link
Copy Markdown
Contributor Author

Also one high level question for @mattklein123

The problem here is we're writing to a closed socket, which causes us to close the connection, and reset the stream under the stack of a write, and then we were processing end stream.

I wonder if instead of immediate connection close, we'd be better off scheduling the close for the next loop to avoid closing the downstream connection under the stack of the upstream connection?

In this case, the connection was closed by QUICHE in its write call stack. I don't think we can delay the close. What we can delay is HCM's reset callback.

Signed-off-by: Dan Zhang <danzh@google.com>
@mattklein123
Copy link
Copy Markdown
Member

I wonder if instead of immediate connection close, we'd be better off scheduling the close for the next loop to avoid closing the downstream connection under the stack of the upstream connection?

No objection from me at a high level if we need to do deferred processing, though I haven't spent the time to look at this in detail. If you want me to look at this change in more detail I can spend some time tomorrow.

Signed-off-by: Dan Zhang <danzh@google.com>
Signed-off-by: Dan Zhang <danzh@google.com>
@alyssawilk
Copy link
Copy Markdown
Contributor

nah I buy dan's argument that it's not worth the refactor
/wait on the integration test now my PR landed

Signed-off-by: Dan Zhang <danzh@google.com>
@danzh2010
Copy link
Copy Markdown
Contributor Author

Integration test is added. PTAL

Copy link
Copy Markdown
Contributor

@alyssawilk alyssawilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm modulo one remaining nit

@@ -3557,7 +3557,7 @@ TEST_P(DownstreamProtocolIntegrationTest, HandleSocketFail) {
socket_swap.write_matcher_->setSourcePort(lookupPort("http"));
socket_swap.write_matcher_->setWriteOverride(ebadf);
// TODO(danzh) set to true.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove TODO?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@alyssawilk alyssawilk assigned alyssawilk and snowp and unassigned alyssawilk Dec 6, 2021
Signed-off-by: Dan Zhang <danzh@google.com>
Copy link
Copy Markdown
Contributor

@snowp snowp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@danzh2010
Copy link
Copy Markdown
Contributor Author

Can it be merged?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants