http: mitigate delayed close timeout race with connection write buffer flush#6437
Conversation
This commit changes the behavior of the delayed close timeout so that the timer is only enabled after a connection's write buffer is flushed. Previously, this timeout was serving double duty as an upper bound on the time that a connection would spend waiting for a write flush as well. However, this behavior created a race where a small delayed close timeout or slow clients could cause the timer to trigger and the connection to be closed prior to the write buffer being fully flushed. Therefore, this commit simplifies the logic and eliminates the idle timer behavior. Signed-off-by: Andres Guedez <aguedez@google.com>
Signed-off-by: Andres Guedez <aguedez@google.com>
Signed-off-by: Andres Guedez <aguedez@google.com>
Signed-off-by: Andres Guedez <aguedez@google.com>
|
|
||
| EXPECT_EQ(test_server_->counter("tcp.tcp_stats.upstream_flush_total")->value(), 1); | ||
| EXPECT_EQ(test_server_->gauge("tcp.tcp_stats.upstream_flush_active")->value(), 0); | ||
| test_server_->waitForGaugeEq("tcp.tcp_stats.upstream_flush_active", 0); |
There was a problem hiding this comment.
This change fixes flakiness I experienced while testing this PR.
@alyssawilk PTAL since you also recently addressed flakiness in this test via #5919.
|
@AndresGuedez @alyssawilk high level question: Do we need two timers here? It seems like we still need an outer bound timeout in which to wait for the data to flush? I'm not sure this path is covered by any other timer at this point in the flow? (I might be not recalling this correctly.) |
My thinking on this right now is that idle timeouts at a higher layer than the Are there edge cases in the connection life cycle that wouldn't be covered by these higher layer idle timeouts? |
I'm not sure, I don't remember. I can't remember when the HTTP idle timer gets re-armed. Let me review the code a bit over the next few days. Thanks a ton for jumping on this. |
I just reviewed the code in conn_manager_impl, and I agree that the idle timer should still cover this case and we would re-arm the timer and then eventually timeout and close the connection if we are not able to fully flush it and wait for the other side to respond. @alyssawilk can you check me on this? Assuming we agree ^ is true this seems like a good thing to do, although I wonder if there are any other callers of this that might now hang without any guarding timeout? @alyssawilk are you interested in taking the initial pass on this review? |
|
Yeah, I think in theory at the point we've set up a drain-close the socket should read-disabled (we do not get to actively use it again) and either it flushes and we close after the timeout interval, or it does not flush and the connection_idle_timer_ goes off. Looks like the idle timer does FlushWrite so as long as we have UTs for FlushWriteAndDelay followed by FlushWrite we should be OK. Happy to take a look but it'll have to be Wednesday as I'm booked the rest of today and out tomorrow :-( |
Thanks for taking a look! I revisited the (connection) idle timeout logic in the So I see a couple of viable options:
I am actively working on option 2. since I think it leads to simpler logic in the |
|
@AndresGuedez (2) sounds great to me. |
Revert the previous approach of enabling the delayed close timer only after the flush is completed. Instead, the timer is enabled when a close(FlushWrite) or close(FlushWriteAndDelay) is issued and the timer is reset when the onWriteReady event fires and data is written to the socket. This prevents the race between flushing and the timer triggering while preserving an upper bound on inactivity after the close() is issued. Signed-off-by: Andres Guedez <aguedez@google.com>
alyssawilk
left a comment
There was a problem hiding this comment.
I think this looks right at at high level. I definitely want to another pass Monday with a pencil and paper and caffeine, but here's a couple of nits in the interim.
| // from the downstream connection) prior to Envoy closing the socket associated with that | ||
| // connection. Note that this timeout may trigger a socket close even when Envoy's write | ||
| // buffer has not been fully flushed after close processing is initiated; this happens when Envoy | ||
| // is unable to write data to the socket for an interval greater than or equal to this timeout |
There was a problem hiding this comment.
I think this comment could use clarification, as it reads more like if it's "can not write any data to the socket during T" or "can not write all data to the socket during T" and it looks like the grace period is absolute from the close call rather than a progress check.
docs/root/intro/version_history.rst
Outdated
| * http: added :ref:`max request headers size <envoy_api_field_config.filter.network.http_connection_manager.v2.HttpConnectionManager.max_request_headers_kb>`. The default behaviour is unchanged. | ||
| * http: added modifyDecodingBuffer/modifyEncodingBuffer to allow modifying the buffered request/response data. | ||
| * http: added encodeComplete/decodeComplete. These are invoked at the end of the stream, after all data has been encoded/decoded respectively. Default implementation is a no-op. | ||
| * http: fixed a bug with the :ref:`delayed_close_timeout<envoy_api_field_config.filter.network.http_connection_manager.v2.HttpConnectionManager.delayed_close_timeout>` where it could trigger prior to flushing the write buffer for the downstream connection. |
There was a problem hiding this comment.
Maybe update here as well? It can still trigger, but won't trigger if progress is being made?
| if (delayed_close_) { | ||
| if (inDelayedClose()) { | ||
| ASSERT(!delayed_close_timeout_set || delayed_close_timer_ != nullptr); | ||
| delayed_close_state_ = (type == ConnectionCloseType::FlushWrite || !delayed_close_timeout_set) |
There was a problem hiding this comment.
Can you break this out for readability?
Signed-off-by: Andres Guedez <aguedez@google.com>
7c02143 to
dd6d9ea
Compare
|
I had to force push to fix missing DCO in the last commit. @alyssawilk apologies if this causes some review history issues. |
mattklein123
left a comment
There was a problem hiding this comment.
Thanks this is great. At a high level LGTM. Some small questions/comments.
/wait
docs/root/intro/version_history.rst
Outdated
| * http: added :ref:`max request headers size <envoy_api_field_config.filter.network.http_connection_manager.v2.HttpConnectionManager.max_request_headers_kb>`. The default behaviour is unchanged. | ||
| * http: added modifyDecodingBuffer/modifyEncodingBuffer to allow modifying the buffered request/response data. | ||
| * http: added encodeComplete/decodeComplete. These are invoked at the end of the stream, after all data has been encoded/decoded respectively. Default implementation is a no-op. | ||
| * http: fixed a bug with the :ref:`delayed_close_timeout<envoy_api_field_config.filter.network.http_connection_manager.v2.HttpConnectionManager.delayed_close_timeout>` where it could trigger while actively flushing a pending write buffer for a downstream connection. |
There was a problem hiding this comment.
Can you merge master again and move this to 1.11.0?
|
|
||
| // States associated with delayed closing of the connection (i.e., when the underlying socket is | ||
| // not immediately close()d as a result of a ConnectionImpl::close()). | ||
| enum class DelayedCloseState { None, CloseAfterFlush, CloseAfterFlushAndTimeout }; |
There was a problem hiding this comment.
Can you add comments for each state? It's not immediately clear what the difference is between the latter 2.
| // Disable the delayed close timer since data is still being flushed. The timer should only | ||
| // trigger after a delayedCloseTimeout() period of inactivity. | ||
| if (delayed_close_timer_ != nullptr) { | ||
| delayed_close_timer_->disableTimer(); |
There was a problem hiding this comment.
Is there any reason to disable the timer here? Won't we boost it below? (Mainly just wondering if we can simplify slightly but there might be a good reason to keep this disable here.)
There was a problem hiding this comment.
I had missed that libevent's event_add() would reset an already active timer with the newly provided timeout value. As you point out, this is not necessary and I have removed it.
| // Validate that a delayed close timer is already enabled unless it was disabled via | ||
| // configuration. | ||
| ASSERT(!delayed_close_timeout_set || delayed_close_timer_ != nullptr); | ||
| if (type == ConnectionCloseType::FlushWrite || !delayed_close_timeout_set) { |
There was a problem hiding this comment.
Out of curiosity, does this state change actually happen or is this preemptive? Just wondering if we can simplify until this becomes an actual issue? Can we just assert the type is the same?
There was a problem hiding this comment.
This was just preemptive. I've changed the logic to a stronger ASSERT() and removed unnecessary tests.
There was a problem hiding this comment.
My latest commit restores this logic. The reason is that after discussing with @alyssawilk offline, I now prefer maintaining Connection::close() API backwards compatibility with the existing behavior which allows ConnectionCloseType transitions between close() calls. This minimizes the (admittedly small) risk that an existing user of the API passing FlushWrite and FlushWriteAndDelay would break after this PR is merged. It also enforces consistent handling of type transitions by allowing all ConnectionCloseType transitions as opposed to special casing FlushWrite and FlushWriteAndDelay.
Another option, which would significantly simplify the close() logic, is to enforce that callers use the same type after the initial close() is issued on a Connection but this would break backwards compatibility and would have a much higher risk of breaking existing users.
There was a problem hiding this comment.
Another option, which would significantly simplify the close() logic, is to enforce that callers use the same type after the initial close() is issued on a Connection but this would break backwards compatibility and would have a much higher risk of breaking existing users.
This would be my preference. Are there any existing users that actually do this? (There might be and it might a reasonable thing to try a graceful close followed by a force closed, I just don't remember).
There was a problem hiding this comment.
This would be my preference. Are there any existing users that actually do this? (There might be and it might a reasonable thing to try a graceful close followed by a force closed, I just don't remember).
Based on a quick scan, I haven't found any filters that attempt a graceful close followed by a forced close (it tends to be one or the other, typically the latter is only used on error handling code paths prior to any other close()s being issued). However, this seems like a reasonable thing to support, and more importantly, the existing logic in the ConnectionImpl destructor forces support for X -> NoFlush transitions since a "just in case" close(NoFlush) is always attempted on destruction. This could be changed, but it seems completely reasonable to me given that leaking connections is much worse than not fully flushing the write buffer or avoiding the race that the delayed_close_timeout is mitigating.
So, the modified proposal would be to only allow state transitions ConnectionCloseType::* -> ConnectionCloseType::NoFlush after the first close() has been issued. This still achieves the goal of simplifying the logic while providing some defense against bugs.
There was a problem hiding this comment.
@alyssawilk @mattklein123 I would prefer to create a second PR for changing the Connection::close() API to enforce the state transition constraints.
This PR should be safe to merge given the amount of review and test coverage added. What do you think about doing this in two steps and I can follow up with the API change after this PR is merged? I think we will end up with a cleaner history and more granular options for rolling back if the API change causes breakage.
…sh-race Signed-off-by: Andres Guedez <aguedez@google.com>
Signed-off-by: Andres Guedez <aguedez@google.com>
| // is pending a flush of the write buffer. However, any progress made writing data to the socket | ||
| // will restart the timer associated with this timeout. This means that the total grace period for | ||
| // a socket in this state will be | ||
| // <delayed_close_timeout>+<total_time_waiting_for_write_buffer_flushes>. |
There was a problem hiding this comment.
Awesome, much more clear, thanks.
One last super nitty nit, I'd reverse the order since the flush happens first.
| // been requested, start the delayed close timer if it hasn't been done already by a previous | ||
| // close(). | ||
| if (!inDelayedClose()) { | ||
| initializeDelayedCloseTimer(); |
There was a problem hiding this comment.
I don't think we gracefully handle a caller doing
setDelayedCloseTimeout(timeout1)
close(ConnectionCloseType::FlushWriteAndDelay)
setDelayedCloseTimeout(timeout2)
close(ConnectionCloseType::FlushWriteAndDelay)
and I don't think we need to. However, do you think it's worth commenting that somewhere in the APIs and/or an assert in setDelayedCloseTimeout that you can't set-after-close or am I overthinking? Arguably doing anything after close is pretty sketchy but we are handling multiple close() calls for good reasons...
There was a problem hiding this comment.
Added ASSERT() and comment.
|
|
||
| // Create and activate a timer which will immediately close the connection if triggered. | ||
| // A config value of 0 disables the timeout. | ||
| if (delayed_close_timeout_set) { |
There was a problem hiding this comment.
optional (style thing) having sanity checked that we couldn't call initializeDelayedCloseTimer() with inDelayedClose() true, I wonder if it's worth putting this in an else{} block just to make it super clear which branch we are on.
There was a problem hiding this comment.
I added a comment. I would prefer not to unnecessarily indent unless you think it makes a large readability difference.
| // Validate that a delayed close timer is already enabled unless it was disabled via | ||
| // configuration. | ||
| ASSERT(!delayed_close_timeout_set || delayed_close_timer_ != nullptr); | ||
| // Validate that the same close type is used when multiple close()s are issued. An edge case |
There was a problem hiding this comment.
Do we make it clear in the class definition what transitions are allowed?
There was a problem hiding this comment.
Done. Added a comment to the DelayedCloseState enum declaration.
| ENVOY_CONN_LOG(debug, "setting delayed close timer with timeout {} ms", *this, | ||
| delayedCloseTimeout().count()); | ||
| delayed_close_timer_->enableTimer(delayedCloseTimeout()); | ||
| initializeDelayedCloseTimer(); |
There was a problem hiding this comment.
can't we still get here with data_to_write > 0? I thought if we did close(DelayedCloseState::CloseAfterFlushAndTimeout) we didn't want to arm the timer until the flush was complete?
There was a problem hiding this comment.
The timer is always armed when a close(FlushWrite) or close(FlushWriteAndDelay) is issued. The only difference between the two close types is that the socket is immediately closed after the flush with the former, while the delayed_close_timeout_ is allowed to expire and trigger with the latter.
There was a problem hiding this comment.
Recapping offline discussions for posterity, I missed the "I am redesigning this" email, Andres missed putting the new design in the description, and we are now untangled :-P
I will say that with the new plan I'm not convinced that this solves #6392 insofar as 20ms pretty short. That said I think it solves an underlying problem worth solving and #6392 may simply need longer timeouts. Might be worth commenting somewhere in the APIs that to be useful this timeout needs to be at O(1 max_rtt + libevent_loop_time) to avoid races. I'll look for places we can add more clarity.
Signed-off-by: Andres Guedez <aguedez@google.com>
alyssawilk
left a comment
There was a problem hiding this comment.
Thanks as always for your patience - I think we're almost there! Just a few nits from me, and I'll wait back to hear on Dan's libevent question
| ENVOY_CONN_LOG(debug, "setting delayed close timer with timeout {} ms", *this, | ||
| delayedCloseTimeout().count()); | ||
| delayed_close_timer_->enableTimer(delayedCloseTimeout()); | ||
| initializeDelayedCloseTimer(); |
There was a problem hiding this comment.
Recapping offline discussions for posterity, I missed the "I am redesigning this" email, Andres missed putting the new design in the description, and we are now untangled :-P
I will say that with the new plan I'm not convinced that this solves #6392 insofar as 20ms pretty short. That said I think it solves an underlying problem worth solving and #6392 may simply need longer timeouts. Might be worth commenting somewhere in the APIs that to be useful this timeout needs to be at O(1 max_rtt + libevent_loop_time) to avoid races. I'll look for places we can add more clarity.
| } | ||
|
|
||
| void ConnectionImpl::onDelayedCloseTimeout() { | ||
| delayed_close_timer_.reset(nullptr); |
There was a problem hiding this comment.
I think we just usually reset()
| delayed_close_state_ = (type == ConnectionCloseType::FlushWrite) | ||
| ? DelayedCloseState::CloseAfterFlush | ||
| : DelayedCloseState::CloseAfterFlushAndTimeout; | ||
| } else { |
There was a problem hiding this comment.
I was all excited that I'd found a bug where we would let connections idle out forever but apparently that's WAI?
Can I ask the API go from
// A value of 0 will completely disable delayed close processing, and the downstream connection's // A value of 0 will completely disable delayed close processing, and the downstream connection's
// socket will be closed immediately after the write flush is completed.
to
// A value of 0 will completely disable delayed close processing, and the downstream connection's // A value of 0 will completely disable delayed close processing, and the downstream connection's
// socket will be closed immediately after the write flush is completed and will never close if the write flush does not complete
maybe with some // .. attention:: flags or DANGER DANGER DANGER if you'd like to leak connections please set this to 0 :-P
There was a problem hiding this comment.
Done. Added a .. WARNING::.
|
Oh also please master merge for coverage - I would like to make sure we have all the branches covered |
|
|
||
| // Test that the delayed close timer is reset while write flushes are happening when a connection is | ||
| // in delayed close mode. | ||
| TEST_P(ConnectionImplTest, DelayedCloseTimerResetWithPendingWriteBufferFlushes) { |
There was a problem hiding this comment.
This test and the one above are pretty much the same other than EXPECT_CALL(*transport_socket, doWrite...). Can you extract common code into helper function or combine these 2 tests into one, i.e. EXPECT_CALL(... doWrite()) twice with different return.
There was a problem hiding this comment.
Collapsed into a single test. Simplifications made throughout this PR made the second test largely redundant as you point out.
| // It is possible (though unlikely) for the connection to have already been closed during the | ||
| // write callback. This can happen if we manage to complete the SSL handshake in the write | ||
| // callback, raise a connected event, and close the connection. | ||
| closeSocket(ConnectionEvent::RemoteClose); |
There was a problem hiding this comment.
I just saw that closeSocket() disables timer. Sorry for my ignorant...
…sh-race Signed-off-by: Andres Guedez <aguedez@google.com>
|
|
||
| closeSocket(ConnectionEvent::LocalClose); | ||
| if (type == ConnectionCloseType::FlushWriteAndDelay && delayed_close_timeout_set) { | ||
| // The socket is being closed and there is no more data to write. Since a delayed close has |
There was a problem hiding this comment.
After one more pass, is this comment correct? Arguably we can also get here where there is data to write, but we're in the midst of a tls/alts handshake and have determined that flushing is pointless (canFlushClose is false) because the write could not flush all data.
I think the canFlushClose() was added for the case where you have plaintext payload queued up behind an unfinished crypto handshake at which point the payload isn't going to get flushed and given old style options you should give up and close immediately.
That said, if there were some crypto protocol where there were large bidi frames, I can imagine the same race we have for HTTP where the FIN + RST lagged behind a large client side "no I am rejecting your handshake" write which is queued in the kernel and so we want a one interval delay for the client to get that response.
I think if we're in that case, the connection is going to observe the transport socket is "blocked", and the alarm will fire after one interval (handling any race) so the code is doing the right thing and the comment can just be tweaked a bit for clarity. Sound right?
There was a problem hiding this comment.
Good catch, thanks for pointing this out. I have clarified the comment.
That said, if there were some crypto protocol where there were large bidi frames, I can imagine the same race we have for HTTP where the FIN + RST lagged behind a large client side "no I am rejecting your handshake" write which is queued in the kernel and so we want a one interval delay for the client to get that response.
I think if we're in that case, the connection is going to observe the transport socket is "blocked", and the alarm will fire after one interval (handling any race) so the code is doing the right thing and the comment can just be tweaked a bit for clarity. Sound right?
Yeah, I agree with this. I revisited the logic for the transport sockets that return canFlushClose() == false and at least for the TLS (SSL) socket, it doesn't seem right that canFlushClose() is conditional on the handshake completing. TLS alerts are transmitted during handshake failures and should be allowed to flush as well. I'll file an issue to follow up on this.
| // Validate that a delayed close timer is already enabled unless it was disabled via | ||
| // configuration. | ||
| ASSERT(!delayed_close_timeout_set || delayed_close_timer_ != nullptr); | ||
| // Validate that the same close type is used when multiple close()s are issued. An edge case |
Signed-off-by: Andres Guedez <aguedez@google.com>
Signed-off-by: Andres Guedez <aguedez@google.com>
Signed-off-by: Andres Guedez <aguedez@google.com>
Signed-off-by: Andres Guedez <aguedez@google.com>
This reverts commit 11008bd and allows callers of the Connection::close() API to change the 'type' arg between 'FlushWrite' and 'FlushWriteAndDelay' when issuing multiple close() calls on the same connection. The new logic matches the existing behavior which allows any 'type' transition between 'close()' calls and eliminates the risk that changing the behavior of the API will break existing external uses (via third party filters) of the Connnection API. Signed-off-by: Andres Guedez <aguedez@google.com>
mattklein123
left a comment
There was a problem hiding this comment.
Thanks, looking great. 2 more comments for discussion.
/wait-any
| // The socket will be closed after a grace period of delayed_close_timeout_ has elapsed after | ||
| // the socket is flushed _or_ if a period of inactivity after the last write event greater than | ||
| // or equal to delayed_close_timeout_ has elapsed. | ||
| CloseAfterFlushAndTimeout |
There was a problem hiding this comment.
nit: WDYT about calling this CloseAfterFlushAndDelay? Both of these states involve timeouts so IMO this is a little confusing.
There was a problem hiding this comment.
I considered CloseAfterFlushAndDelay but ultimately decided against it because both CloseAfterFlush and CloseAfterFlushAndTimeout introduce a delay into close() processing by waiting for either the flush or the <flush+timer trigger> to happen.
I could rename to CloseAfterFlushAndWait if you think that's clearer than CloseAfterFlushAndTimeout.
There was a problem hiding this comment.
Yeah I do think Wait is more clear than Timeout in this case if you don't mind.
| // Validate that a delayed close timer is already enabled unless it was disabled via | ||
| // configuration. | ||
| ASSERT(!delayed_close_timeout_set || delayed_close_timer_ != nullptr); | ||
| if (type == ConnectionCloseType::FlushWrite || !delayed_close_timeout_set) { |
There was a problem hiding this comment.
Another option, which would significantly simplify the close() logic, is to enforce that callers use the same type after the initial close() is issued on a Connection but this would break backwards compatibility and would have a much higher risk of breaking existing users.
This would be my preference. Are there any existing users that actually do this? (There might be and it might a reasonable thing to try a graceful close followed by a force closed, I just don't remember).
Signed-off-by: Andres Guedez <aguedez@google.com>
…sh-race Signed-off-by: Andres Guedez <aguedez@google.com>
mattklein123
left a comment
There was a problem hiding this comment.
Awesome, thanks. @danzh2010 any further comments?
Signed-off-by: Andres Guedez <aguedez@google.com>
…sh-race Signed-off-by: Andres Guedez <aguedez@google.com>
Signed-off-by: Andres Guedez <aguedez@google.com>
|
I merged master to resolve version history conflict. Thanks for the reviews everyone! |
…r flush (envoyproxy#6437) Change the behavior of the delayed_close_timeout such that it won't trigger unless there has been at least a delayed_close_timeout period of inactivity after the last write event on the socket pending to be closed. This mitigates a race where a slow client and/or low timeout value would cause the socket to be closed while data was actively being written to the socket. Note that this change does not eliminate this race since a slow client could still be considered idle by the updated timeout logic, but this should be very rare when useful values (i.e., >1s to avoid the race condition on close that this timer addresses) are configured. Risk Level: Medium Testing: New unit tests added Docs Changes: Updated version history and HttpConnectionManager proto doc Fixes envoyproxy#6392 Signed-off-by: Andres Guedez <aguedez@google.com> Signed-off-by: Piotr Sikora <piotrsikora@google.com>
Change the behavior of the
delayed_close_timeoutsuch that it won't trigger unless there has been at least adelayed_close_timeoutperiod of inactivity after the last write event on the socket pending to be closed.This mitigates a race where a slow client and/or low timeout value would cause the socket to be closed while data was actively being written to the socket. Note that this change does not eliminate this race since a slow client could still be considered idle by the updated timeout logic, but this should be very rare when useful values (i.e., >1s to avoid the race condition on close that this timer addresses) are configured.
Risk Level: Medium
Testing: New unit tests added
Docs Changes: Updated version history and HttpConnectionManager proto doc
Fixes #6392