-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Fix Blackholed Connection Behavior in DisruptableMockTransport #61310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Blackholed Connection Behavior in DisruptableMockTransport #61310
Conversation
It is not realistic to drop messages without eventually failing. To retain the coverage of long pauses this PR adjusts the blockholed behavior to fail a send after 24h (which is assumed to be longer than any timeout in the system) instead of never. Closes elastic#61034
|
Pinging @elastic/es-distributed (:Distributed/Cluster Coordination) |
DaveCTurner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we also need this for responses sent in the BLACK_HOLE and DISCONNECTED states.
Does this appreciably change the running time of the tests?
I wasn't sure about adding that. It did seemed unnecessary and we currently don't do anything on disconnected either and just handle it as we handle
Not really in my testing, I bet there's some degenerate case where it does :P but over 1k+ iterations it looks irrelevant so far. |
|
@DaveCTurner thanks for taking a look, see here https://github.com/elastic/elasticsearch/blob/master/test/framework/src/main/java/org/elasticsearch/test/disruption/DisruptableMockTransport.java#L200 for the response handling, currently we do the same for black hole and disconnect there. |
I think the removal of the join timeout will expose the same bug there, given enough iterations on CI. I.e. the join request gets through but then the connection is blackholed/disconnected before the response comes back, so it's never delivered. In reality the requester would drop the connection eventually thanks to keepalives. |
I think the reason this wasn't and isn't an issue already is that we never black-hole while we have a subset of all runnable tasks at a given timestamp when we blackhole/disconnect a connection. So the send and respond cycle will always happen in one go and it's impossible that we blackhole between the send and respond right? Also, I'm not sure it would change any behavior for the join (or any other part of the code if we were to throw on the response sending) because it's always code like this for the response: private JoinCallback transportJoinCallback(TransportRequest request, TransportChannel channel) {
return new JoinCallback() {
@Override
public void onSuccess() {
try {
channel.sendResponse(Empty.INSTANCE);
} catch (IOException e) {
onFailure(e);
}
}
@Override
public void onFailure(Exception e) {
try {
channel.sendResponse(e);
} catch (Exception inner) {
inner.addSuppressed(e);
logger.warn("failed to send back failure on join request", inner);
}
}where it's just logging as a result of a failed response send. @Override
public void sendResponse(final TransportResponse response) {
execute(new Runnable() {
@Override
public void run() {to simulate some differences in timing when sending responses so we can't really throw to whatever code invoked |
|
Yeah delivering an exception to the responder is kinda pointless, there's nothing it can do about it, but we should still deliver an exception response to the requester in those cases. |
Well currently this situation isn't a thing to begin with since we always |
|
Discussed this sync: |
|
@DaveCTurner alright, I think d0b3d1f should do it here right? (ran ~20k iterations of the coordinator tests with it without issues or excessive slowness) |
|
urgh nervermind this needs a test adjustment now :) on it |
|
@DaveCTurner sorry for the noise, should be good to review now :) |
DaveCTurner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One further request about blackholed-response behaviour. I may be persuaded to keep things as they are now tho.
| case DISCONNECTED: | ||
| logger.trace("dropping response to {}: channel is {}", requestDescription, connectionStatus); | ||
| logger.trace("disconnected during response to {}: channel is {}", requestDescription, connectionStatus); | ||
| onDisconnectedDuringSend(requestId, action, destinationTransport); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I think I'd prefer a long delay on the response here using onBlackholedDuringSend too. We're using DISCONNECTED to indicate that the connection actively rejects the message, e.g. sends a RST, but if it rejects the response then the original requester is none the wiser and may wait for a long time before discovering the disconnect.
In practice it's almost never going to be that bad but I'd rather err on the pathological side if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ adjusted accordingly in b181920 for both spots
| logger.trace("dropping exception response to {}: channel is {}", requestDescription, connectionStatus); | ||
| logger.trace("disconnected during exception response to {}: channel is {}", | ||
| requestDescription, connectionStatus); | ||
| onDisconnectedDuringSend(requestId, action, destinationTransport); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, we should delay notifying the sender here too.
DaveCTurner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Thanks David! |
…ic#61310) It is not realistic to drop messages without eventually failing. To retain the coverage of long pauses this PR adjusts the blackholed behavior to fail a send after 24h (which is assumed to be longer than any timeout in the system) instead of never. Closes elastic#61034
It is not realistic to drop messages without eventually failing.
To retain the coverage of long pauses this PR adjusts the blockholed
behavior to fail a send after 24h (which is assumed to be longer than any
timeout in the system) instead of never.
Closes #61034