-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
200ms delay receiving Consume batches #48
Comments
It really looks like the corking problem. Stracing both the node and the client showed the following:
So, we can see that we got the Continue message at .703935us and that we sent and flushed the data at .704266us, so there's a difference of less than 1ms. So that's all good. However, from the client's side this looks like (adjust the different timing):
Several things happen here:
|
Here's the interleaved output of the operations done on client and node:
|
Looking into this output may not be enough. I have a theory that could be confirmed/rejected by capturing the TCP packets.
So, this is a tricky situation, if I'm right. One way to solve this would be to turn on the nodelay and control the buffering with the The second way would be non-portable (heh) is to use
|
Trying |
I described the Nagle/delayed-ACK dilemma in sociomantic-tsunami/swarm#83 (comment): "
" Do you think this might be happening here? |
It was my hypothesis, but it proved wrong looking in the packet dump after applying OK, so capturing the traffic showed the following on the node side:
Hypothesis is the following:
Now, the entire thing about TCP cork is discussable. We used before because we didn't want to use application-level buffering, and we wanted to prevent packets going out for every single record. This would lead to the congestion on the network and to the increased CPU in the clients (the last point was observable). However, during the development of the library, we've started using application level buffering - all batched-based requests are not sending anything except the full batches (and the last batch). And on the other side, all the data we need to send immediately (like controller signals or update's from the DHT Mirror request) we flush out immediately. So, it looks like we don't want cork anyway: for the requests where the latency is important, we'll send the data immediately, and for the batches of the records, we'll do batching on the application level. One quick way to test this is to restart the node with |
Now, I'm not sure if we should go with the The both advantage and disadvantage of In contrast, NODELAY guarantees there's no delay while sending the data, but that also means that we need to be very careful. Writing the key and then the value to the socket will always result in two IP packets, unless that's all sent within a single |
Interesting point. See the documentation about
I do wonder if that's what we want in our
(this is said under presumption that we still find our kernel-layer buffering useful). |
Most requests don't write in that way. They'll write a single message containing all required fields. |
Ok, the cork causing problems hypothesis looks promising.
We can see between line 35 (the last read that reads the full packet) and line 36 more than 200ms of pause.
We can see here two things: First - the number of the read calls didn't increase - from the application's side this makes the same amount of the context switches as the corked variant, and there's no delay between receiving the last full packet and the last non-complete packet. |
The experiments shows that the flushing by keeping the cork on and just setting
|
The existing `flush` method relied on the TCP_CORK being set and it would then pull out and put the cork back in. However this doesn't work, because putting the cork back in had to be done after all the packets are actually sent, otherwise the last incomplete packet will be delayed for the 200ms. Since we moved to the explicit application buffering for the large data and to the explicit flushing for the control messages this flush was deprecated. See sociomantic-tsunami/dmqproto#48
The existing `flush` method relied on the TCP_CORK being set and it would then pull out and put the cork back in. However this doesn't work, because putting the cork back in had to be done after all the packets are actually sent, otherwise the last incomplete packet will be delayed for the 200ms. Since we moved to the explicit application buffering for the large data and to the explicit flushing for the control messages this flush was deprecated. See sociomantic-tsunami/dmqproto#48
As per dmqproto#48, TCP_CORK based flushing doesn't work in general, however it's also no longer necessary, since we're doing explicit buffering via record batching in Consume, and for all the control messages, we're anyway flushing as soon as we send it. This makes TCP_NODELAY fit for this purpose: as soon as we write the control message it will be sent, and the batch will be sent as soon as sendmsg returns, so there's no need for explicit flushing. Fixes sociomantic-tsunami#48
As per dmqproto#48, TCP_CORK based flushing doesn't work in general, however it's also no longer necessary, since we're doing explicit buffering via record batching in Consume, and for all the control messages, we're anyway flushing as soon as we send it. This makes TCP_NODELAY fit for this purpose: as soon as we write the control message it will be sent, and the batch will be sent as soon as sendmsg returns, so there's no need for explicit flushing. Fixes #48
The existing `flush` method relied on the TCP_CORK being set and it would then pull out and put the cork back in. However this doesn't work, because putting the cork back in had to be done after all the packets are actually sent, otherwise the last incomplete packet will be delayed for the 200ms. Since we moved to the explicit application buffering for the large data and to the explicit flushing for the control messages this flush was deprecated. See sociomantic-tsunami/dmqproto#48
The existing `flush` method relied on the TCP_CORK being set and it would then pull out and put the cork back in. However this doesn't work, because putting the cork back in had to be done after all the packets are actually sent, otherwise the last incomplete packet will be delayed for the 200ms. Since we moved to the explicit application buffering for the large data and to the explicit flushing for the control messages this flush was deprecated. See sociomantic-tsunami/dmqproto#48
This will be solved when the dmqnode moves to the new swarm, but perhaps we could change the node always to pass |
Until the dmqnode uses the new swarm I think fixing it in the |
Testing indicates that there is a ~200ms delay between the client sending the continue signal and it receiving the next batch of records. This is unexpectedly slow, and indicates that a flush call is missing somewhere.
The text was updated successfully, but these errors were encountered: