Skip to content

http: fixing a watermark bug#4553

Merged
alyssawilk merged 5 commits intoenvoyproxy:masterfrom
alyssawilk:watermark
Oct 2, 2018
Merged

http: fixing a watermark bug#4553
alyssawilk merged 5 commits intoenvoyproxy:masterfrom
alyssawilk:watermark

Conversation

@alyssawilk
Copy link
Contributor

@alyssawilk alyssawilk commented Sep 27, 2018

As documented in #4541 it appears that in the H2 case both the codec, in [Client|Server]ConnectionImpl::newStream, and the http connection manager in ConnectionManagerImpl::newStream call high watermark callbacks when a new stream is created. This results in double counting for tcp connection level blocks in the H2 path and connection stalls.

This PR removes the watermark callbacks from the http connection manager, adds the to the http/1.x codec for consistency, then adds an assert in the http connection manager to theoretically regression test other codecs.

This has the ugliest integration tests I have yet written for Envoy. I'm open to suggestions...

Risk Level: High
Testing: yes, unfortunately
Fixes: #4541

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
@alyssawilk alyssawilk changed the title WIP: http: fixing a watermark bug http: fixing a watermark bug Sep 28, 2018
@mattklein123
Copy link
Member

Thanks @alyssawilk for jumping on this. I'm not going to get a chance to review until this weekend or Monday and will have to page all of this code back in. @polivbr in the meantime can you test this change? It looks like there might be some test issue but seeing if the fix solves your issue would be great.

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
@polivbr
Copy link

polivbr commented Sep 28, 2018

I'd be happy to test these changes, and should be able to confirm in the next few hours. Thanks for the quick turnaround!

@polivbr
Copy link

polivbr commented Sep 28, 2018

Looks good from my end. My test which used to fail within seconds has been running for 30 minutes without issue.

@mattklein123
Copy link
Member

Thanks @polivbr! We will get this reviewed and merged early next week.

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing. I had some initial questions. Will take a look at tests once I understand the change better.


void HttpIntegrationTest::testTwoRequests() {
// This filter exists to synthetically test network backup by faking TCP
// conneciton back-up when an encode is finished.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo "conneciton"

// If the network connection is backed up, the stream should be made aware of
// it on creation. Both HTTP/1.x and HTTP/2 codecs handle this.
ASSERT(read_callbacks_->connection().aboveHighWatermark() == false ||
new_stream->high_watermark_count_ > 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused when/how high_watermark_count_ gets incremented in this call chain in order to make this assert true. From looking at the HTTP/2 codec it looks like we run high watermark callbacks in ServerConnectionImpl::onBeginHeaders before we call newStream. Wouldn't we have to run the callbacks instead when we call addCallbacks for this to work properly? I'm probably missing something here. If so can you add more comments?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I can interject, the callback are in fact run in addCallbacks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, I see it, thanks. Can we potentially clarify the comment? This makes sense. I will take a look at the tests now.

const std::string StreamEncoderImpl::CRLF = "\r\n";
const std::string StreamEncoderImpl::LAST_CHUNK = "0\r\n\r\n";

StreamEncoderImpl::StreamEncoderImpl(ConnectionImpl& connection) : connection_(connection) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as above about when this call happens in relation to the newStream() call.

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some test comments. Impressively hacky. :)

}

Network::ConnectionImpl* connection() {
// As long as wie're doing horrible things let's do *all* the horrible things.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo "wie're"

//
// It is up to the users of this filter to make sure the connection_impl
// will outlive the timer.
SelfOwningTimer* timer_container = new SelfOwningTimer;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: in C++14 you can actually make this a unique_ptr and capture it into the lambda below using "generalized lambda capture". Then you can avoid the raw new/delete which is kind of nice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL :-)


Http::FilterFactoryCb createFilter(const std::string&, Server::Configuration::FactoryContext&) {
return [&](Http::FilterChainFactoryCallbacks& callbacks) -> void {
absl::WriterMutexLock m(&encode_lock_);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: What is the acquiring the lock here required for? Can you add a comment?

connection_impl->onLowWatermark();
delete timer_container;
});
timer_container->timer_->enableTimer(std::chrono::milliseconds(50));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems surely destined to cause flakes. Any thoughts on how to make this less flaky? Is there anything we can do with @jmarantz time source stuff here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW this should work fine when HttpIntegrationTest is constructed with a simTime().

If this path is exercised when HttpIntegrationTest is constructed with realTIme() I would expect the behavior to vary.

Copy link
Contributor Author

@alyssawilk alyssawilk Oct 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Howso? The whole point is to semi-randomly block and semi-randomly unblock the socket. AFIK it'll only cause flakes if we actually block the network socket which again I was completely unable to do on all of our linux builds, with large responses and artificially small buffers.

Edit: I guess it could cause flakes if we had timer lifetime problems. I think we shouldn't though I'll think on it more. I think if I tied the timer to the lifetime of the connection we could fix any potential issues there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be fine, but just from my quick look it seems like it could flake if the 2nd stream comes in either before or after we unblock? Or is the test not meant to be deterministic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it would not guarantee a repro if the second stream arrived unexpectedly. In practice it happened to fail all the time but I agree on a slow system the repro would not be deterministic.

If we prefer deterministic testing I can fix that by tracking the number of decodes, and unblocking after the second stream is created. I kind of like fuzz testing timing but I wouldn't be super sad to see the self-owning timer and lifetime issues go away

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO having a deterministic test would be better if we had one test, but perhaps if it's possible to do determinism we can also leave this test as is? Will defer to you on how to handle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think I like this - I can do "random" encode/decode logic when I generalize for other tess.

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Very nice!

@alyssawilk alyssawilk merged commit fc7dfc6 into envoyproxy:master Oct 2, 2018
aa-stripe pushed a commit to aa-stripe/envoy that referenced this pull request Oct 11, 2018
As documented in envoyproxy#4541 it appears that in the H2 case both the codec, in [Client|Server]ConnectionImpl::newStream, and the http connection manager in ConnectionManagerImpl::newStream call high watermark callbacks when a new stream is created. This results in double counting for tcp connection level blocks in the H2 path and connection stalls.

This PR removes the watermark callbacks from the http connection manager, adds the to the http/1.x codec for consistency, then adds an assert in the http connection manager to theoretically regression test other codecs.

Risk Level: High
Testing: new integration test
Fixes: envoyproxy#4541

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
Signed-off-by: Aaltan Ahmad <aa@stripe.com>
@alyssawilk alyssawilk deleted the watermark branch November 28, 2018 16:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants