http: fixing a watermark bug#4553
Conversation
Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
ee65310 to
1aa1041
Compare
1aa1041 to
7226203
Compare
|
Thanks @alyssawilk for jumping on this. I'm not going to get a chance to review until this weekend or Monday and will have to page all of this code back in. @polivbr in the meantime can you test this change? It looks like there might be some test issue but seeing if the fix solves your issue would be great. |
Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
7226203 to
13174b9
Compare
|
I'd be happy to test these changes, and should be able to confirm in the next few hours. Thanks for the quick turnaround! |
|
Looks good from my end. My test which used to fail within seconds has been running for 30 minutes without issue. |
|
Thanks @polivbr! We will get this reviewed and merged early next week. |
mattklein123
left a comment
There was a problem hiding this comment.
Thanks for fixing. I had some initial questions. Will take a look at tests once I understand the change better.
test/integration/http_integration.cc
Outdated
|
|
||
| void HttpIntegrationTest::testTwoRequests() { | ||
| // This filter exists to synthetically test network backup by faking TCP | ||
| // conneciton back-up when an encode is finished. |
| // If the network connection is backed up, the stream should be made aware of | ||
| // it on creation. Both HTTP/1.x and HTTP/2 codecs handle this. | ||
| ASSERT(read_callbacks_->connection().aboveHighWatermark() == false || | ||
| new_stream->high_watermark_count_ > 0); |
There was a problem hiding this comment.
I'm confused when/how high_watermark_count_ gets incremented in this call chain in order to make this assert true. From looking at the HTTP/2 codec it looks like we run high watermark callbacks in ServerConnectionImpl::onBeginHeaders before we call newStream. Wouldn't we have to run the callbacks instead when we call addCallbacks for this to work properly? I'm probably missing something here. If so can you add more comments?
There was a problem hiding this comment.
If I can interject, the callback are in fact run in addCallbacks.
There was a problem hiding this comment.
Ah yes, I see it, thanks. Can we potentially clarify the comment? This makes sense. I will take a look at the tests now.
| const std::string StreamEncoderImpl::CRLF = "\r\n"; | ||
| const std::string StreamEncoderImpl::LAST_CHUNK = "0\r\n\r\n"; | ||
|
|
||
| StreamEncoderImpl::StreamEncoderImpl(ConnectionImpl& connection) : connection_(connection) { |
There was a problem hiding this comment.
Same question as above about when this call happens in relation to the newStream() call.
mattklein123
left a comment
There was a problem hiding this comment.
Some test comments. Impressively hacky. :)
test/integration/http_integration.cc
Outdated
| } | ||
|
|
||
| Network::ConnectionImpl* connection() { | ||
| // As long as wie're doing horrible things let's do *all* the horrible things. |
test/integration/http_integration.cc
Outdated
| // | ||
| // It is up to the users of this filter to make sure the connection_impl | ||
| // will outlive the timer. | ||
| SelfOwningTimer* timer_container = new SelfOwningTimer; |
There was a problem hiding this comment.
nit: in C++14 you can actually make this a unique_ptr and capture it into the lambda below using "generalized lambda capture". Then you can avoid the raw new/delete which is kind of nice.
|
|
||
| Http::FilterFactoryCb createFilter(const std::string&, Server::Configuration::FactoryContext&) { | ||
| return [&](Http::FilterChainFactoryCallbacks& callbacks) -> void { | ||
| absl::WriterMutexLock m(&encode_lock_); |
There was a problem hiding this comment.
Q: What is the acquiring the lock here required for? Can you add a comment?
test/integration/http_integration.cc
Outdated
| connection_impl->onLowWatermark(); | ||
| delete timer_container; | ||
| }); | ||
| timer_container->timer_->enableTimer(std::chrono::milliseconds(50)); |
There was a problem hiding this comment.
This seems surely destined to cause flakes. Any thoughts on how to make this less flaky? Is there anything we can do with @jmarantz time source stuff here?
There was a problem hiding this comment.
FWIW this should work fine when HttpIntegrationTest is constructed with a simTime().
If this path is exercised when HttpIntegrationTest is constructed with realTIme() I would expect the behavior to vary.
There was a problem hiding this comment.
Howso? The whole point is to semi-randomly block and semi-randomly unblock the socket. AFIK it'll only cause flakes if we actually block the network socket which again I was completely unable to do on all of our linux builds, with large responses and artificially small buffers.
Edit: I guess it could cause flakes if we had timer lifetime problems. I think we shouldn't though I'll think on it more. I think if I tied the timer to the lifetime of the connection we could fix any potential issues there.
There was a problem hiding this comment.
It might be fine, but just from my quick look it seems like it could flake if the 2nd stream comes in either before or after we unblock? Or is the test not meant to be deterministic?
There was a problem hiding this comment.
Yeah, it would not guarantee a repro if the second stream arrived unexpectedly. In practice it happened to fail all the time but I agree on a slow system the repro would not be deterministic.
If we prefer deterministic testing I can fix that by tracking the number of decodes, and unblocking after the second stream is created. I kind of like fuzz testing timing but I wouldn't be super sad to see the self-owning timer and lifetime issues go away
There was a problem hiding this comment.
IMO having a deterministic test would be better if we had one test, but perhaps if it's possible to do determinism we can also leave this test as is? Will defer to you on how to handle.
There was a problem hiding this comment.
Actually I think I like this - I can do "random" encode/decode logic when I generalize for other tess.
Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
As documented in envoyproxy#4541 it appears that in the H2 case both the codec, in [Client|Server]ConnectionImpl::newStream, and the http connection manager in ConnectionManagerImpl::newStream call high watermark callbacks when a new stream is created. This results in double counting for tcp connection level blocks in the H2 path and connection stalls. This PR removes the watermark callbacks from the http connection manager, adds the to the http/1.x codec for consistency, then adds an assert in the http connection manager to theoretically regression test other codecs. Risk Level: High Testing: new integration test Fixes: envoyproxy#4541 Signed-off-by: Alyssa Wilk <alyssar@chromium.org> Signed-off-by: Aaltan Ahmad <aa@stripe.com>
As documented in #4541 it appears that in the H2 case both the codec, in [Client|Server]ConnectionImpl::newStream, and the http connection manager in ConnectionManagerImpl::newStream call high watermark callbacks when a new stream is created. This results in double counting for tcp connection level blocks in the H2 path and connection stalls.
This PR removes the watermark callbacks from the http connection manager, adds the to the http/1.x codec for consistency, then adds an assert in the http connection manager to theoretically regression test other codecs.
This has the ugliest integration tests I have yet written for Envoy. I'm open to suggestions...
Risk Level: High
Testing: yes, unfortunately
Fixes: #4541