Delay connection reuse for a poll cycle to catch closed connections. by jplevyak · Pull Request #7159 · envoyproxy/envoy

jplevyak · 2019-06-04T17:36:51Z

Signed-off-by: John Plevyak jplevyak@gmail.com

For an explanation of how to fill out the fields, please see the relevant section
in PULL_REQUESTS.md

Description: Currently we attempt to delay connection reuse until after a poll() cycle, but the connection can be reused more quickly if another connection comes in before the poll() and because libevent does not guarantee the order of events. This change uses a separate delay_ list and delays for 2 cycles to ensure that a poll() occurs. This dramatically reduces the number of reused but closed connections in production tests.
Risk Level: Low
Testing: 24 hours with 5 apps in production.
Docs Changes:
Release Notes:
[Optional Fixes #Issue] istio/istio#13848
[Optional Deprecated:]

Signed-off-by: John Plevyak <jplevyak@gmail.com>

jplevyak · 2019-06-04T17:38:09Z

Note: this includes the changes from which will be merged out when #7124 is merged.

mattklein123 · 2019-06-04T17:43:24Z

Please don't mix multiple PRs that are open. At first glance I have concerns about this PR but I don't want to actually review with the other one open. Does this PR supersede the other one?

/wait

jplevyak · 2019-06-04T18:32:58Z

Sorry, the test changes here depend on the HTTP/1.0 fix, which is why that is included here. I will ping this thread when I have merged the HTTP/1.0 fix.

…reuse Signed-off-by: John Plevyak <jplevyak@gmail.com>

jplevyak · 2019-06-05T22:49:52Z

I have updated PR and it is now ready for review.

Signed-off-by: John Plevyak <jplevyak@gmail.com>

mattklein123

I will be honest that I'm not a fan of this change. IMO it's just kicking the race condition can down the road. I'm not sure how 2 event loop cycles is any different than 3, etc. since we don't know when the remote close will actually happen.

If we are going to do such a thing, I think the correct solution here is to add a config option to allow connections to be delayed for X amount of time before being moved from the delayed list back to the active list (or just some fixed amount of time).

/wait

jplevyak · 2019-06-06T00:02:03Z

So, while I agree that this is not the ultimate solution and in fact, given that the upstream can close at any time there is no ultimate solution but:

we already have code which attempts to do this, but which fails because libevent provides no ordering guarantees and because another connection request can come in before the poll() loop.
this was created directly in response to a customer issue Almost every app gets UC errors, 0.012% of all requests in 24h period istio/istio#13848 and was tested by the customer and addressed a non-insignificant part of the problem.

mattklein123 · 2019-06-06T00:19:02Z

I'm glad that it addressed the customer issue, but I think this is still ultimately a hack and I'm not in favor of adding this type of complexity to the code base.

If we want to discuss adding a configurable delay timer, that sounds reasonable to me. cc @lizan @alyssawilk for further comment.

mattklein123 · 2019-06-06T01:03:12Z

Actually, can you also refresh my memory of why we need this delay at all? Why doesn't the upstream server indicate it is going to close the connection?

jplevyak · 2019-06-06T01:05:36Z

If you don't like this change then I would suggest that we remove the existing hack which attempts to prevent immediate reuse as it adds nearly the same level of complexity but was ineffective in practice for the reasons I outlined above and it provides a false sense of security.

mattklein123 · 2019-06-06T01:09:05Z

+1 to remove it, but I'm trying to remember why we added it in the first place. @lizan?

jplevyak · 2019-06-06T01:12:11Z

Sadly the HTTP/1.0 HTTP/1.1 specification does not require that the upstream server signal that it is going to close the connection in any way. The upshot is that it is perfectly acceptable for a server to complete an operation and then close the connection without sending a Connection: close header. Before the HTTP/1.0 fix, envoy would routinely attempt to reuse a connection to a 1.0 server which it had immediately closed after sending the response. Such a request would result in an error, but not necessarily before some data had been written into the write buffer. Particularly as the order of actions is WRITE then READ . When this happens for a POST, it should not be retried and the error propagates back to the client.

mattklein123 · 2019-06-06T02:27:20Z

Before the HTTP/1.0 fix, envoy would routinely attempt to reuse a connection to a 1.0 server which it had immediately closed after sending the response.

Understood, that was clearly a bug and I'm glad we fixed it.

What I'm wondering is whether the original defer fix was actually "fixing" HTTP/1.0 upstreams, or we are actually dealing with servers not indicating close. Given that you also did the HTTP/1.0 fix, presumably your customer is using HTTP/1.0 and their issues will be fixed? Or do they also have issues with HTTP/1.1? Can they modify their code to send the close header if they are going to close the connection?

My feeling on the delay "fix," is that if we want to do it, it should actually be an optional config where if a timer is configured the connection is put on the delayed list for X period of time. In the default case we wouldn't do anything and would effectively revert the existing hack/fix. WDYT?

lizan · 2019-06-06T04:56:27Z

re: HTTP/1.0 fix, IIRC we've never been support HTTP/1.0 upstreams, there was only limited and not fully standards compliant downstream HTTP/1.0 support in Envoy.

The original fix to delay the attaching pending request process is due to Envoy HTTP codecs are not half-close aware. That change only saves the case that Envoy can read FIN before processing next request. #2715 and #2871 has more context of that.

@jplevyak Can you elaborate why do you need the delay_ counter to 2? I don't think that is needed if you move to delay_clients_ list in 1 poll and move back to ready_clients_ list in another? This change seems delays it for 3 cycles. Also @jplevyak do we know what is the upstream stack and its version which causes this?

mattklein123 · 2019-06-06T05:29:58Z

Lizan I don't see what half close has to do with this. That should not be relevant for an upstream connection. We are about to rehash the giant argument in the issue you linked to.

The upstream server is not indicating it is immediately closing the connection and in the real world that is a bug. I'm strongly against continued "fixes" in this area, especially one like this with counting poll loops, etc.

If we want to add an opt in feature to delay connection reuse to deal with buggy upstream servers that's the way I would approach this.

lizan · 2019-06-06T05:36:40Z

@mattklein123 HTTP doesn't support half-close, so when the codec try to see whether upstream connection is half-closed it always get false. #2715 (comment)

mattklein123 · 2019-06-06T05:45:09Z

Ok, yes, agreed on that part. I don't remember why we fixed it this way with a single poll cycle vs some other more direct way but I would have to go back through. I think we did it the way we did due to regression risk.

In any case, that's a specific case which is not this. I'm not ok with this unless we want to do an opt in generic delay feature.

lambdai · 2019-06-06T05:55:18Z

envoy has this idle_timeout to actively close the upstream connection, tackling the upstream client idle close after certain time(5s if default node.js http server).

If we know nothing about upstream, theoretically envoy could never prevent upstream concurrent close. However, there is a good chance that the upstream http server would close immediately after the http response.

Combining this PR and the setting of upstream idle_timeout we can eliminate 99.99% sane upstream error at almost 0 cost while enabling connection pool.

lizan · 2019-06-06T06:03:52Z

I think we did it the way we did due to regression risk.

Right because having whole HTTP stack half-close aware would be a large change.

The upstream server is not indicating it is immediately closing the connection and in the real world that is a bug. I'm strongly against continued "fixes" in this area, especially one like this with counting poll loops, etc.

+1, I don't think the counting needed, need more explanation on the counting part. If this is just complements some cases that I missed in #2871 then it is probably fine (which is not very clear now).

lambdai · 2019-06-06T06:04:29Z

Can you elaborate why do you need the delay_ counter to 2?
@lizan I can explain. It could happen that

poll batch X: client A register clean up at next poll batch
poll batch X+1: before the clean up, client B register the clean up. The intention is to be in X+2.
poll batch X+1: at the clean up, both A and B are cleaned because both are in the same list which is unexpected behavior for client B.

There might be alternative approach to achieve the goal that client should be reused at least after one poll. IMHO die at the second clean up might be the easy one, if we don't want to rely on the subtle implementation of libevent.

lizan · 2019-06-06T06:08:53Z

poll batch X+1: at the clean up, both A and B are cleaned because both are in the same list which is unexpected behavior for client B.

@silentdai This is not unexpected and intended, the only case that we're saving in #2871 is until the end of the read event, regardless which poll cycle. The use of poll cycle is to make sure the clean up is after processing the specific read event (at the end of read event it will read end of stream).

lambdai · 2019-06-06T06:48:30Z

at the end of read event it will read end of stream

How? Where is the attempt to read? You mean the one returns EAGAIN to adapt edge trigger?

What #2871 is missing processIdleClient is moving the client to ready list too early. See my comment in #2871. The solution is to delay insertion into ready list. Delay at least 1 poll cycle. Let's skip the counter = 2 part.

lizan · 2019-06-06T06:52:23Z

@silentdai See my comment there. In https://github.com/envoyproxy/envoy/blob/v1.10.0/source/common/network/connection_impl.cc#L486 if transport socket read some bytes + EOS (this is the ONLY case that #2871 is trying to save), the upstream connection will be closed after filters process the read bytes, then the client will be removed.

lambdai · 2019-06-06T07:11:43Z

@lizan I see what you mean. Even I didn't really calculate the number, I admit that #2871 should already reduced a few 503 cases. However there are more.

transport socket read some bytes + EOS

In real world EOS(or FIN in Raw transport socket) and the last a few bytes would be in separate socket. We observe that in load test. Even in med-load test, we see the case that EOS is not read until next poll cycle.

The below load test will bring up the first 503 in 20K requests.

upstream: node.js or python SimpleHttpServer at http1.0 (half-close is needed)
envoy: concurrency = 2 or 1, talking to upstream through lo
client: ab or siege, concurrency = 10

With this PR the 503 rate reduce to 1/100 or even lower.

mattklein123 · 2019-06-06T16:27:06Z

Right because having whole HTTP stack half-close aware would be a large change.

@lizan in looking at this briefly again, I'm wondering why we didn't just add a halfClosedStatus() API or similar to the connection and just call that instead of keeping track of remote_closed_ in codec client. That would have been simple. In any case, totally understood why we did the single poll loop, because we will get the closed event immediately once the stack unwinds. Can we potentially add a comment to the conn pool implementation to remind us of this if we don't already have one?

@silentdai @jplevyak the case you are trying to "fix" here is different. Envoy has not yet read the FIN from the connection. It's coming some time later. The only change I'm amenable to in this area is an opt-in feature to allow connection re-use being delayed by some number of milliseconds. This can be off by default and then Istio can opt-in. If you want to build the feature that way, let's go for it. Otherwise I woudl recommend that you tell the customer to indicate that they are going to close the connection in their response.

lambdai · 2019-06-06T17:27:14Z

delayed by some number of milliseconds

We could go this way although delaying one poll cycle is more elegant and practical. All known http server/application either close immediately or after idle seconds. At former case, the period between last byte and EOS is unpredictable. we know it's short but no one can tell. A larger number will always safer but it reduces connection reuse rate.

I think a poll cycle in this PR is an awesome try. Since

it's adaptive at heavy traffic and light traffic
it doesn't introduce cost
it works well in real world
it save the user efforts to tune the best timeout value but we know the value is not a fixed number.

BTW, it's only only istio is affected. It's microservice arch which is more 503 sensitive than monoarch.

lizan · 2019-06-06T17:36:02Z

I agree this is not Istio specific issue, though as you said:

the period between last byte and EOS is unpredictable.

So poll cycles are not an elegant way to solve this, when you delay N cycles, EOS may arrive in N+1-th cycle. The number is to be determined by the deployment so if we do this it should be configurable.

As noted in #2715, a retry (at Envoy level or application level) is preferred approach, #2715 (comment). Regardless of POST or GET, the status code 503 has a retry-able semantics defined in RFC 7231. (cc @jplevyak).

mattklein123 · 2019-06-06T18:10:14Z

So poll cycles are not an elegant way to solve this, when you delay N cycles, EOS may arrive in N+1-th cycle. The number is to be determined by the deployment so if we do this it should be configurable.

Exactly. Ultimately, you are trying to introduce a delay. If you want a delay, let's build it right which means making it configurable, off by default, and with a variable millisecond delay. This is the only solution that I'm amenable to as I said above.

All known http server/application either close immediately or after idle seconds. At former case, the period between last byte and EOS is unpredictable. we know it's short but no one can tell.

All well behaving HTTP/1.1 servers indicate they are going to close the connection if they are going to immediately close it (Envoy does this). As I have said over and over again here and in the linked issues, this is well known timing issue with HTTP/1.1.

So to summarize, the options here are to:

Drop this change
Implement it correctly with an optional re-use delay timer.

lizan · 2019-06-06T18:22:14Z

A side question @silentdai @jplevyak did you try #7124 and this change individually with the customer for istio/istio#13848 or together? I'm curious how much of each change contributes to the fix.

jplevyak · 2019-06-06T18:22:34Z

While "Regardless of POST or GET, the status code 503 has a retry-able semantics defined in RFC 7231" the problem is that the upstream server isn't returning 503, it is just closing the connection. A broken connection is not retry-able for POST, so envoy can't do that without overriding the spec.

This problem manifests in production with some frequency because (it seems) both node (with some frameworks) and Apache HTTPD use 5 second keep-alive timeout so a potential solution is to use a 4 second inactivity timeout in envoy.

alyssawilk · 2019-06-06T18:26:02Z

For clarification, is this trying to institute a delay, or trying to make sure that a FIN which was already machine local is read into user-space?

If it's the latter, I do support making sure that we detect closed connections, even though as Matt notes there's no guarantee that the terminal FIN actually arrives with the terminal data. I also agree that waiting 2 loops feels kind of hacky. Could we potentially peek before returning the connection to the pool? I'm not sure if peeking actually works for FINs but if that works it might be a compromise.

jplevyak · 2019-06-06T18:27:52Z

Regarding the customer issue: I don't know the relative contribution of each change. I think the biggest contribution to "fixing" the issue was dramatically increasing the keep-alive timeout on the upstream server.

I would be for dropping this change and removing the existing partial delay code.

We could add a reuse delay, but I don't think that is the right solution. Instead I suggest that we instrument the code to determine under what conditions we see unexpected errors on connection reuse. If we can determine (for example) that "Server: Apache.*" connections are dropped after a particular time we can set a default timeout a bit shorter than that.

lizan · 2019-06-06T19:05:23Z

For clarification, is this trying to institute a delay, or trying to make sure that a FIN which was already machine local is read into user-space?

I would be for dropping this change and removing the existing partial delay code.

Again, the existing partial delay code is NOT trying to institute a delay, but make sure a FIN which is already read into user-space take effective while stack unwinding. I'm fine if you revert it and go with #7159 (comment) halfClosedStatus() in connection, or make HTTP codec fully half-close aware.

jplevyak · 2019-06-06T19:07:11Z

Regarding the FIN local/peek question/suggestion: silentdai tried that and I also instrumented the code to see what was happening when the server closed immediately after sending the response but without Connection: coose, and poll is returning POLLRDHUP which is converted into EV_CLOSE by libevent which is then converted to FileReadyType::Closed by file_event_impl.cc which is then ignored by envoy if we are reading and are not using half-close semantics (which is the case for HTTP during connection reuse). Delaying by a poll() loop let's use detect the close as a Network::ConnectionEvent::RemoteClose.

Probing also detected this case.

However, when the upstream server closed after some delay (e.g. 5 seconds), then there was a race. We found that both the delay and the probe were at times unable to prevent an error because the FIN could arrive after the probe but before the read/write.

So, to answer your question, the peek/probe of the connection still doesn't solve the problem but it does decreases the probability just as delaying a poll() cycle does. I'll leave silentdai to tell us if he knows how much each change decreases the probability.

jplevyak · 2019-06-06T19:53:41Z

lizan, the code in onReadyReady calls doRead on the socket which loops, so that if there was already a FIN along with the data on the socket (e.g. a write + close by the server) then the code would detect it without the existing code which attempts to delay reuse. That code isn't necessary in this case. The only case in which it might be useful is if there was a delay between the doRead and when the FIN arrives.

That is also the case that this PR is addressing. That is why I agree with Matt. We should remove the existing code and drop this PR and if we want to add a delay before reuse we can do so with an explicit time limit. However, I still think that we should just instrument the code and try to determine the circumstances of unexpected closes as a delay might not be the best solution. The two cases I know about are the HTTP/1.0 missing keep-alive (now fixed) and the 5 second timeout on Apache HTTPD and some node servers. The latter is not addressed by this PR or the existing delaying reuse code, so I don't have a real world example which requires the delay.

I am curious if anyone else does.

lizan · 2019-06-06T20:03:42Z

lizan, the code in onReadyReady calls doRead on the socket which loops, so that if there was already a FIN along with the data on the socket (e.g. a write + close by the server) then the code would detect it without the existing code which attempts to delay reuse. That code isn't necessary in this case. The only case in which it might be useful is if there was a delay between the doRead and when the FIN arrives.

No, see https://github.com/envoyproxy/envoy/blob/v1.10.0/source/common/network/connection_impl.cc#L486 the HTTP connection have enable_half_close_ = false, so the EOS is not propagating to filters until the stack unwind.

The server immediately closes connection without Connection: close or HTTP/1.0 indicator (thanks for #7124 fix) is a real world bug which should be fixed on that server side.

lambdai · 2019-06-06T20:54:52Z

The server immediately closes connection without Connection: close or HTTP/1.0 indicator (thanks for #7124 fix) is a real world bug which should be fixed on that server side.

I don't know much work need to be done to fix this bug, as it's not bug in envoy's code base.

jplevyak · 2019-06-06T21:16:02Z

lizan, I am looking not at the Event::FileReadyType::Closed event which indeed as you point out is not handled because of the lack of half-close handling, but instead at RawBufferSocket::DoRead which will detect the EOF as a read().result.rc_ == 0 and set end_stream == true in IoResult which will then be handled in ConnectionImpl::onReadReady by converting into PostIOAcction::Close which will result in a closeSocket(ConnectionEvent::RemoteClose).

This is how an immediate close which results in pending FIN will be detected by envoy.

alyssawilk · 2019-06-10T12:50:01Z

"I don't know much work need to be done to fix this bug, as it's not bug in envoy's code base."

I agree it's not a bug in Envoy's code base, but it's a pretty common problem (having come up with a number of users and a number of deployments) and as we work on supporting Envoy as a forward proxy speaking to the open internet (where we can't fix all buggy upstreams) do think it's one we're going to want to handle gracefully.

jplevyak · 2019-06-12T17:28:52Z

This code catches an immediate close after the completion of a transaction. Before the 1.0 fix this would happen on every request for 1.0 servers which did not send Connection: close (e.g. flask).

As per our discussions I don't think that this delay is needed anymore since the other cases which we have seen involve closes some time later (i.e. when the upstream server inactivity timeout fires).

I am going to close this and create a PR to remove the existing delay code (which doesn't work as intended in any case).

I will also open an issue to add instrumentation for unexpected closes.

jplevyak · 2019-06-12T17:32:47Z

#7248

incomplete in that libevent doesn't guarantee ordering of events so delaying for a single poll cycle does not address the issue. Also because the connection is left in the ready queue it is available for another request coming in the same pool cycle or in the next poll cycle before ready connection is processed. Finally, this would only catch connections closed immediately which should be very infrequent since the HTTP/1.0 fix which eliminated those cases where a missing Connection: keep-alive was not interpreted as Connection: close, resulting in reuse of a connection which was immedately closed. See envoyproxy#7159 for details. Signed-off-by: John Plevyak <jplevyak@gmail.com>

Delay connection reuse for a poll cycle to catch closed connections.

7d703b6

Signed-off-by: John Plevyak <jplevyak@gmail.com>

mattklein123 self-assigned this Jun 4, 2019

repokitteh-read-only bot added the waiting label Jun 4, 2019

Merge remote-tracking branch 'upstream/master' into delay-connection-…

ad137f1

…reuse Signed-off-by: John Plevyak <jplevyak@gmail.com>

repokitteh-read-only bot removed the waiting label Jun 5, 2019

Merge.

6f02b78

Signed-off-by: John Plevyak <jplevyak@gmail.com>

mattklein123 requested changes Jun 5, 2019

View reviewed changes

repokitteh-read-only bot added the waiting label Jun 5, 2019

mattklein123 mentioned this pull request Jun 11, 2019

Connection reset without sending close header #7228

Closed

jplevyak closed this Jun 12, 2019

jplevyak mentioned this pull request Jun 13, 2019

Remove the http1 connection pool delay. #7261

Closed

Conversation

jplevyak commented Jun 4, 2019

Uh oh!

jplevyak commented Jun 4, 2019

Uh oh!

mattklein123 commented Jun 4, 2019

Uh oh!

jplevyak commented Jun 4, 2019

Uh oh!

jplevyak commented Jun 5, 2019

Uh oh!

mattklein123 left a comment

Choose a reason for hiding this comment

Uh oh!

jplevyak commented Jun 6, 2019

Uh oh!

mattklein123 commented Jun 6, 2019

Uh oh!

mattklein123 commented Jun 6, 2019

Uh oh!

jplevyak commented Jun 6, 2019

Uh oh!

mattklein123 commented Jun 6, 2019

Uh oh!

jplevyak commented Jun 6, 2019

Uh oh!

mattklein123 commented Jun 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lizan commented Jun 6, 2019

Uh oh!

mattklein123 commented Jun 6, 2019

Uh oh!

lizan commented Jun 6, 2019

Uh oh!

mattklein123 commented Jun 6, 2019

Uh oh!

lambdai commented Jun 6, 2019

Uh oh!

lizan commented Jun 6, 2019

Uh oh!

lambdai commented Jun 6, 2019

Uh oh!

lizan commented Jun 6, 2019

Uh oh!

lambdai commented Jun 6, 2019

Uh oh!

lizan commented Jun 6, 2019

Uh oh!

lambdai commented Jun 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattklein123 commented Jun 6, 2019

Uh oh!

lambdai commented Jun 6, 2019

Uh oh!

lizan commented Jun 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattklein123 commented Jun 6, 2019

Uh oh!

lizan commented Jun 6, 2019

Uh oh!

jplevyak commented Jun 6, 2019

Uh oh!

alyssawilk commented Jun 6, 2019

Uh oh!

jplevyak commented Jun 6, 2019

Uh oh!

lizan commented Jun 6, 2019

Uh oh!

jplevyak commented Jun 6, 2019

Uh oh!

jplevyak commented Jun 6, 2019

Uh oh!

lizan commented Jun 6, 2019

Uh oh!

lambdai commented Jun 6, 2019

Uh oh!

jplevyak commented Jun 6, 2019

mattklein123 commented Jun 6, 2019 •

edited

Loading

lambdai commented Jun 6, 2019 •

edited

Loading

lizan commented Jun 6, 2019 •

edited

Loading