200ms delay receiving Consume batches #48

gavin-norman-sociomantic · 2018-02-22T11:03:22Z

Testing indicates that there is a ~200ms delay between the client sending the continue signal and it receiving the next batch of records. This is unexpectedly slow, and indicates that a flush call is missing somewhere.

nemanja-boric-sociomantic · 2018-02-22T15:27:42Z

It really looks like the corking problem. Stracing both the node and the client showed the following:

The node sends the 100k of data, followed by the immediate cork flush:

15:20:03.703935 read(30, "\1\t\0\0\0\0\0\0\0\10\1\0\0\0\0\0\0\0\1", 65536) = 19 <0.000005>
15:20:03.704195 sendmsg(30, {msg_name(0)=NULL, msg_iov(5)=[{"\1\33\207\1\0\0\0\0\0\234", 10}, {"\1\0\0\0\0\0\0\0", 8}, {"\3", 1}, {"\n\207\1\0\0\0\0\0", 8}, {"xxx (data)"..., 100106}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 100133 <0.000032>
15:20:03.704249 setsockopt(30, SOL_TCP, TCP_CORK, [0], 4) = 0 <0.000005>
15:20:03.704266 setsockopt(30, SOL_TCP, TCP_CORK, [1], 4) = 0 <0.000005>

So, we can see that we got the Continue message at .703935us and that we sent and flushed the data at .704266us, so there's a difference of less than 1ms. So that's all good.

However, from the client's side this looks like (adjust the different timing):

15:03:07.550422 sendmsg(819, {msg_name(0)=NULL, msg_iov(3)=[{"\1\t\0\0\0\0\0\0\0\10", 10}, {"\1\0\0\0\0\0\0\0", 8}, {"\1", 1}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 19 <0.000006>
15:03:07.550449 setsockopt(819, SOL_TCP, TCP_CORK, [0], 4) = 0 <0.000007>
15:03:07.550468 setsockopt(819, SOL_TCP, TCP_CORK, [1], 4) = 0 <0.000005>
15:03:07.551082 read(819, "\0\0\0l\0\0\0\0"..., 100141) = 1448 <0.000006>
15:03:07.551127 read(819, "\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 98693) = 4344 <0.000006>
15:03:07.551196 read(819, "9\\"..., 94349) = 1448 <0.000006>
15:03:07.551245 read(819, "\0\25\0"..., 92901) = 4344 <0.000006>
15:03:07.551305 read(819, "\0\252\\0\0"..., 88557) = 1448 <0.000005>
15:03:07.551435 read(819, "\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 87109) = 1448 <0.000005>
15:03:07.551561 read(819, "\1\0\0\0\0\0\0"..., 85661) = 1448 <0.000005>
15:03:07.551707 read(819, "0\0\0\0\0\0"..., 84213) = 1448 <0.000005>
15:03:07.551829 read(819, "\\0\0"..., 82765) = 1448 <0.000005>
15:03:07.551885 read(819, "\\0"..., 81317) = 7240 <0.000007>
15:03:07.551947 read(819, "\0\0\0"..., 74077) = 4344 <0.000006>
15:03:07.552009 read(819, "\316\0\0\0\0\0"..., 69733) = 1448 <0.000005>
15:03:07.552134 read(819, "\0E"..., 68285) = 4344 <0.000006>
15:03:07.552190 read(819, "\0\0\0\0009\t"..., 63941) = 1448 <0.000006>
15:03:07.552250 read(819, "7\30B\7"..., 62493) = 1448 <0.000005>
15:03:07.552314 read(819, "\"..., 61045) = 1448 <0.000005>
15:03:07.552429 read(819, "\0\00"..., 59597) = 1448 <0.000005>
15:03:07.552485 read(819, "0"..., 58149) = 4344 <0.000005>
15:03:07.552541 read(819, "\00\0\0"..., 53805) = 7240 <0.000006>
15:03:07.552589 read(819, "\0\0\\0\0\0\0\0\0"..., 46565) = 5792 <0.000006>
15:03:07.552644 read(819, "\0\l"..., 40773) = 5792 <0.000006>
15:03:07.552699 read(819, "\1\"..., 34981) = 7240 <0.000006>
15:03:07.552760 read(819, "\0\0\\0"..., 27741) = 4344 <0.000006>
15:03:07.552820 read(819, "\0\6\1\0\0\0"..., 23397) = 2896 <0.000006>
15:03:07.552883 read(819, "\0\0\0\0\0"..., 20501) = 1448 <0.000006>
15:03:07.552940 read(819, "\0\0\0\0\0\0\0"..., 19053) = 2896 <0.000005>
15:03:07.552996 read(819, "6"..., 16157) = 2896 <0.000006>
15:03:07.553052 read(819, "u16\1\0\0\0\0\0\0\0"..., 13261) = 2896 <0.000006>
15:03:07.553114 read(819, "\0\0\0\\0\0\0\0\0\0\0\1\0\0\0"..., 10365) = 1448 <0.000005>
15:03:07.553202 read(819, "\0\0\\0\0\0 "..., 8917) = 1448 <0.000006>
15:03:07.553258 read(819, "\2\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 7469) = 7240 <0.000006>
15:03:07.761955 read(819, "\0\0\0\0\0\0\1\0\0\0\0\0\0\0"..., 229) = 156 <0.000006>

Several things happen here:

The client sends the Continue message at .550422us and flushes it immediately at .550468.
The client receives the node's message fragmented. All fragments except the last arrive with 3ms (the last timing is .553258)
However, the last packet, the last bit of the batch arrives 200ms (at .761955) later!

nemanja-boric-sociomantic · 2018-02-22T15:51:09Z

Here's the interleaved output of the operations done on client and node:

*C* 15:43:11.514986 sendmsg(819, {msg_name(0)=NULL, msg_iov(3)=[{"\1\t\0\0\0\0\0\0\0\10", 10}, {"\1\0\0\0\0\0\0\0", 8}, {"\1", 1}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 19 <0.000007>
*C* 15:43:11.515018 setsockopt(819, SOL_TCP, TCP_CORK, [0], 4) = 0 <0.000009>
*S* 15:43:11.515039 read(30, "\1\t\0\0\0\0\0\0\0\10\1\0\0\0\0\0\0\0\1", 65536) = 19 <0.000009>
*C* 15:43:11.515044 setsockopt(819, SOL_TCP, TCP_CORK, [1], 4) = 0 <0.000005>
*S* 15:43:11.515320 sendmsg(30, {msg_name(0)=NULL, msg_iov(5)=[{"\1\371\206\1\0\0\0\0\0\177", 10}, {"\1\0\0\0\0\0\0\0", 8}, {"\3", 1}, {"\350\206\1\0\0\0\0\0", 8}, {"i\0xxxxxxxx\\|xxx"..., 100072}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 100099 <0.000065>
*S* 15:43:11.515408 setsockopt(30, SOL_TCP, TCP_CORK, [0], 4) = 0 <0.000019>
*S* 15:43:11.515441 setsockopt(30, SOL_TCP, TCP_CORK, [1], 4) = 0 <0.000004>
*C* 15:43:11.515967 read(819, "\1xxxxxi\0\0\0\0"..., 100141) = 43440 <0.000014>
*C* 15:43:11.516045 read(819, "xxxx0\0\0\0&H\315"..., 56701) = 7240 <0.000008>
*C* 15:43:11.516095 read(819, "\0\0xxxx\0\0\0\0\2\0\0\0\0\0\0\0"..., 49461) = 5792 <0.000007>
*C* 15:43:11.516147 read(819, "2<\xxxx0\0\0\0\0\0\0\0\0\0\0\0"..., 43669) = 5792 <0.000006>
*C* 15:43:11.516200 read(819, "\0\0xxxx\0Z\36"..., 37877) = 5792 <0.000006>
*C* 15:43:11.516253 read(819, "\0\0\xxxx\0\0\0\0\0\0\0"..., 32085) = 7240 <0.000007>
*C* 15:43:11.516308 read(819, "xxxx0\0\0\0"..., 24845) = 5792 <0.000006>
*C* 15:43:11.516360 read(819, "\0\0xxxxx\0\00035\21\0\0\0"..., 19053) = 5792 <0.000007>
*C* 15:43:11.516411 read(819, "\0xxx0"..., 13261) = 5792 <0.000007>
*C* 15:43:11.516467 read(819, "xxx\0\0\0\0\0"..., 7469) = 7240 <0.000007>
*C* 15:43:11.721723 read(819, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 229) = 187 <0.000010>
*S* 15:43:11.850780 read(30, "\1\t\0\0\0\0\0\0\0\10\1\0\0\0\0\0\0\0\1", 65536) = 19 <0.000005>

nemanja-boric-sociomantic · 2018-02-22T17:49:05Z

Looking into this output may not be enough. I have a theory that could be confirmed/rejected by capturing the TCP packets.

By default, TCP_NODELAY is off on the node. This means that the node will wait 200ms before sending the data, especially if there's a delayed TCP ACK: https://en.wikipedia.org/wiki/TCP_delayed_acknowledgment
Flusing the TCP_CORK may not control this, at least I can't find the reference to that it has any effect. My speculation is based on the fact that the data has been given to the TCP stack by sendmsg and that the TCP's responsibility now is to pass the packets around. At least that's what I conclude by looking into the timings: clearly the sendmsg returns immediately, and there's going back and forth there, possibly on the kernel level.
Now, my theory is that the Consumer delays the ACK and then the consumer will wait for the next packet, but the sender will wait for its ACK.
Eventually, 200ms times out and the sender sends the packet anyway, the receiver gets it and sends the ACK.

So, this is a tricky situation, if I'm right.

One way to solve this would be to turn on the nodelay and control the buffering with the TCP_CORK. Even if the TCP_NODELAY is on, setting the cork will not let them go. However, there will be no waiting for the ACKs. However, this is problematic because of the Push protocol for example. We could, however, turn on this on the Consume sockets.

The second way would be non-portable (heh) is to use TCP_QUICKACK on both sides (I think this particular situation can be fixed by applying it just on the consumer, will try it tomorrow):

TCP_QUICKACK (since Linux 2.4.4)
Enable quickack mode if set or disable quickack mode if cleared. In quickack mode, acks are sent immediately, rather than delayed if needed in accordance to normal TCP operation. This flag is not permanent, it only enables a switch to or from quickack mode. Subsequent operation of the TCP protocol will once again enter/leave quickack mode depending on internal protocol processing and factors such as delayed ack timeouts occurring and data transfer. This option should not be used in code intended to be portable.

https://linux.die.net/man/7/tcp

gavin-norman-sociomantic · 2018-02-23T08:54:50Z

Trying TCP_QUICKACK sounds like a good idea.

david-eckardt-sociomantic · 2018-02-23T11:12:41Z

I described the Nagle/delayed-ACK dilemma in sociomantic-tsunami/swarm#83 (comment):

"
TCP delayed ACK can escalate Nagle's algorithm if only Alice sends data to Bob, and Bob never sends data:

Alice uses Nagle's algorithm which waits for an ACK from Bob before sending a partial segment.
Bob uses delayed ACK which will send that ACK only together with data.
Bob never sends data so he never sends an ACK so Alice never receives an ACK so Alice never sends the partial segment.

"

Do you think this might be happening here?

nemanja-boric-sociomantic · 2018-02-23T12:02:27Z

Do you think this might be happening here?

It was my hypothesis, but it proved wrong looking in the packet dump after applying TCP_QUICKACK.

OK, so capturing the traffic showed the following on the node side:

All TCP packets filling MTU were sent immediately
Client was (probably because TCP_QUICKACK) sending ACKs immediately.
Then, the last packet not filling the MTU remains. The ACK for this packet was received at t0 but the packet was not sent until t0+200ms

Hypothesis is the following:

To do flush, we uncork and cork the socket again. This all finishes before the first packet is sent
over the network.
Now, packets are being transferred. However, at this point, on this socket TCP_CORK has been set and all the packets that fill the maximum size are sent immediately, but the last one has not been sent (because of the cork) and it will be sent after the timeout.

Now, the entire thing about TCP cork is discussable. We used before because we didn't want to use application-level buffering, and we wanted to prevent packets going out for every single record. This would lead to the congestion on the network and to the increased CPU in the clients (the last point was observable). However, during the development of the library, we've started using application level buffering - all batched-based requests are not sending anything except the full batches (and the last batch). And on the other side, all the data we need to send immediately (like controller signals or update's from the DHT Mirror request) we flush out immediately. So, it looks like we don't want cork anyway: for the requests where the latency is important, we'll send the data immediately, and for the batches of the records, we'll do batching on the application level.

One quick way to test this is to restart the node with no_delay=true config option. This will disable Nagle's algorithm, but also disable cork.

nemanja-boric-sociomantic · 2018-02-23T12:08:31Z

Now, I'm not sure if we should go with the NODELAY or QUICKACK options.

The both advantage and disadvantage of TCP_QUICKACK is that the Nagle's algorithm still applies. What this means is that if we do two writes, one for the key and one for the value, the will be held in the buffer and probably it will be sent together in one TCP packet as long as that happens so quick that the previous ACK hasn't arrived (so we have one round-trip time to do implicit buffering). The disadvantage of it is that there's one round-trip latency for all incomplete data (which may not be that bad).

In contrast, NODELAY guarantees there's no delay while sending the data, but that also means that we need to be very careful. Writing the key and then the value to the socket will always result in two IP packets, unless that's all sent within a single writev call. This puts a burden on the request to properly either schedule their real writes (and the write is hidden by the EventDispatcher abstraction) or to create whatever they want to send and push it over the network.

nemanja-boric-sociomantic · 2018-02-23T12:26:20Z

Interesting point. See the documentation about TCP_NODELAY in man 7 tcp:

If set, disable the Nagle algorithm. This means that segments are always sent as soon as possible even if there is only a small amount of data. When not set, data is buffered until there is a sufficient amount to send out, thereby avoiding the frequent sending of small packets, which results in poor utilization of the network. This option is overridden by TCP_CORK; however, setting this option force an explicit flush of pending output, even if TCP_CORK is currently set.

I do wonder if that's what we want in our flush call: Set on the TCP_NODELAY option, leave the cork on:

void flush()
{
    this.socket.setsockOptVal(IPPROTO_TCP, TCP_NODELAY, true);
}

(this is said under presumption that we still find our kernel-layer buffering useful).

gavin-norman-sociomantic · 2018-02-23T13:30:12Z

What this means is that if we do two writes, one for the key and one for the value

Most requests don't write in that way. They'll write a single message containing all required fields.

nemanja-boric-sociomantic · 2018-02-23T14:10:10Z

Ok, the cork causing problems hypothesis looks promising.

Results with the TCP_CORK, flushing with cork on/off, as seen on the client side:

  1 14:39:36.141915 setsockopt(30, SOL_TCP, TCP_CORK, [0], 4) = 0 <0.000015>                                            
  2 14:39:36.141945 setsockopt(30, SOL_TCP, TCP_CORK, [1], 4) = 0 <0.000005>                                            
  3 14:39:36.146616 read(30, "xxx"..., 100035) = 2896 <0.000009>                                                        
  4 14:39:36.146768 read(30, "xxx"..., 97139) = 2896 <0.000006>                                                         
  5 14:39:36.147038 read(30, "xxx"..., 94243) = 2896 <0.000006>                                                         
  6 14:39:36.147250 read(30, "xxx"..., 91347) = 2896 <0.000006>                                                         
  7 14:39:36.147514 read(30, "xxx"..., 88451) = 2896 <0.000006>                                                         
  8 14:39:36.147740 read(30, "xxx"..., 85555) = 2896 <0.000006>                                                         
  9 14:39:36.148006 read(30, "xxx"..., 82659) = 2896 <0.000006>                                                         
 10 14:39:36.148231 read(30, "xxx"..., 79763) = 2896 <0.000006>                                                         
 11 14:39:36.148501 read(30, "xxx"..., 76867) = 2896 <0.000006>                                                         
 12 14:39:36.148724 read(30, "xxx"..., 73971) = 2896 <0.000005>                                                         
 13 14:39:36.148984 read(30, "xxx"..., 71075) = 2896 <0.000005>                                                         
 14 14:39:36.149218 read(30, "xxx"..., 68179) = 2896 <0.000005>                                                         
 15 14:39:36.149478 read(30, "xxx"..., 65283) = 2896 <0.000006>                                                         
 16 14:39:36.149707 read(30, "xxx"..., 62387) = 2896 <0.000005>                                                         
 17 14:39:36.149970 read(30, "xxx"..., 59491) = 2896 <0.000006>                                                         
 18 14:39:36.150197 read(30, "xxx"..., 56595) = 2896 <0.000006>                                                         
 19 14:39:36.150472 read(30, "xxx"..., 53699) = 2896 <0.000005>                                                         
 20 14:39:36.150694 read(30, "xxx"..., 50803) = 2896 <0.000006>                                                         
 21 14:39:36.150950 read(30, "xxx"..., 47907) = 2896 <0.000005>                                                         
 22 14:39:36.151182 read(30, "xxx"..., 45011) = 2896 <0.000006>                                                         
 23 14:39:36.151477 read(30, "xxx"..., 42115) = 2896 <0.000005>                                                         
 24 14:39:36.151683 read(30, "xxx"..., 39219) = 2896 <0.000006>                                                         
 25 14:39:36.151936 read(30, "xxx"..., 36323) = 2896 <0.000006>                                                         
 26 14:39:36.152162 read(30, "xxx"..., 33427) = 2896 <0.000005>                                                         
 27 14:39:36.152455 read(30, "xxx"..., 30531) = 2896 <0.000007>                                                         
 28 14:39:36.152677 read(30, "xxx"..., 27635) = 2896 <0.000007>                                                         
 29 14:39:36.152928 read(30, "xxx"..., 24739) = 2896 <0.000006>                                                         
 30 14:39:36.153176 read(30, "xxx"..., 21843) = 2896 <0.000005>                                                         
 31 14:39:36.153423 read(30, "xxx"..., 18947) = 2896 <0.000007>                                                         
 32 14:39:36.153670 read(30, "xxx"..., 16051) = 2896 <0.000010>                                                         
 33 14:39:36.154345 read(30, "xxx"..., 13155) = 8688 <0.000012>                                                         
 34 14:39:36.154692 read(30, "xxx"..., 4467) = 2896 <0.000009>                                                          
 35 14:39:36.154900 read(30, "xxx"..., 1571) = 1448 <0.000007>                                                          
 36 14:39:36.408643 read(30, "xxx"..., 123) = 123 <0.000029>                                                            
 37 14:39:36.566519 sendmsg(30, {msg_name(0)=NULL, msg_iov(3)=[{"\1\t\0\0\0\0\0\0\0\10", 10}, {"\1\0\0\0\0\0\0\0", 8}, {
    "\1", 1}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 19 <0.000010>

We can see between line 35 (the last read that reads the full packet) and line 36 more than 200ms of pause.

Results without TCP_CORK, and TCP_NODELAY set:

  1 14:51:11.513399 setsockopt(30, SOL_TCP, TCP_CORK, [0], 4) = 0 <0.000032>-                                           
  2 14:51:11.513446 setsockopt(30, SOL_TCP, TCP_CORK, [1], 4) = 0 <0.000006>-                                           
  3 14:51:11.513880 read(30, "xxx", 65536) = 19 <0.000008>-                                                             
  4 14:51:11.515058 read(30, "xxx"..., 65536) = 2896 <0.000008>-                                                        
  5 14:51:11.515321 read(30, "xxx"..., 97139) = 2896 <0.000007>-                                                        
  6 14:51:11.515571 read(30, "xxx"..., 94243) = 2896 <0.000007>-                                                        
  7 14:51:11.515799 read(30, "xxx"..., 91347) = 2896 <0.000006>-                                                        
  8 14:51:11.516059 read(30, "xxx"..., 88451) = 2896 <0.000006>-                                                        
  9 14:51:11.516322 read(30, "xxx"..., 85555) = 2896 <0.000007>-                                                        
 10 14:51:11.516553 read(30, "xxx"..., 82659) = 2896 <0.000006>-                                                        
 11 14:51:11.516780 read(30, "xxx"..., 79763) = 2896 <0.000006>-                                                        
 12 14:51:11.517042 read(30, "xxx"..., 76867) = 2896 <0.000006>-                                                        
 13 14:51:11.517272 read(30, "xxx"..., 73971) = 2896 <0.000007>-                                                        
 14 14:51:11.517799 read(30, "xxx"..., 71075) = 2896 <0.000006>-                                                        
 15 14:51:11.518024 read(30, "xxx"..., 68179) = 2896 <0.000006>-                                                        
 16 14:51:11.518295 read(30, "xxx"..., 65283) = 2896 <0.000006>-                                                        
 17 14:51:11.518517 read(30, "xxx"..., 62387) = 2896 <0.000006>-                                                        
 18 14:51:11.518778 read(30, "xxx"..., 59491) = 2896 <0.000006>-                                                        
 19 14:51:11.519007 read(30, "xxx"..., 56595) = 2896 <0.000006>-                                                        
 20 14:51:11.519269 read(30, "xxx"..., 53699) = 2896 <0.000006>-                                                        
 21 14:51:11.519500 read(30, "xxx"..., 50803) = 2896 <0.000006>-                                                        
 22 14:51:11.519768 read(30, "xxx"..., 47907) = 2896 <0.000006>-                                                        
 23 14:51:11.520002 read(30, "xxx"..., 45011) = 2896 <0.000006>-                                                        
 24 14:51:11.520297 read(30, "xxx"..., 42115) = 2896 <0.000011>-                                                        
 25 14:51:11.520508 read(30, "xxx"..., 39219) = 2896 <0.000009>-                                                        
 26 14:51:11.520801 read(30, "xxx"..., 36323) = 2896 <0.000016>-                                                        
 27 14:51:11.521018 read(30, "xxx"..., 33427) = 2896 <0.000009>-                                                        
 28 14:51:11.521258 read(30, "xxx"..., 30531) = 2896 <0.000008>-                                                        
 29 14:51:11.521480 read(30, "xxx"..., 27635) = 2896 <0.000007>-                                                        
 30 14:51:11.521730 read(30, "xxx"..., 24739) = 2896 <0.000006>-                                                        
 31 14:51:11.521956 read(30, "xxx"..., 21843) = 2896 <0.000006>-                                                        
 32 14:51:11.522243 read(30, "xxx"..., 18947) = 2896 <0.000009>-                                                        
 33 14:51:11.522460 read(30, "xxx"..., 16051) = 2896 <0.000007>-                                                        
 34 14:51:11.522744 read(30, "xxx"..., 13155) = 2896 <0.000009>-                                                        
 35 14:51:11.522985 read(30, "xxx"..., 10259) = 2896 <0.000013>-                                                        
 36 14:51:11.523229 read(30, "xxx"..., 7363) = 2896 <0.000013>-                                                         
 37 14:51:11.523428 read(30, "xxx"..., 4467) = 2896 <0.000005>-                                                         
 38 14:51:11.523701 read(30, "xxx"..., 1571) = 1571 <0.000006>-                                                         
 39 14:51:11.667995 sendmsg(30, {msg_name(0)=NULL, msg_iov(3)=[{"\1\t\0\0\0\0\0\0\0\10", 10}, {"\1\0\0\0\0\0\0\0", 8}, {
    "\1", 1}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 19 <0.000007>-

We can see here two things: First - the number of the read calls didn't increase - from the application's side this makes the same amount of the context switches as the corked variant, and there's no delay between receiving the last full packet and the last non-complete packet.

nemanja-boric-sociomantic · 2018-02-23T14:19:36Z

The experiments shows that the flushing by keeping the cork on and just setting TCP_NODELAY works as described in man pages:

  1 15:04:42.957307 sendmsg(30, {msg_name(0)=NULL, msg_iov(3)=[{"\1\t\0\0\0\0\0\0\0\10", 10}, {"\1\0\0\0\0\0\0\0", 8}, {
    "\1", 1}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 19 <0.000009>                                            
  2 15:04:42.957342 setsockopt(30, SOL_TCP, TCP_CORK, [0], 4) = 0 <0.000021>                                            
  3 15:04:42.957376 setsockopt(30, SOL_TCP, TCP_CORK, [1], 4) = 0 <0.000005>                                            
  4 15:04:42.960363 read(30, "xxx"..., 100035) = 2896 <0.000025>                                                        
  5 15:04:42.960488 read(30, "xxx"..., 97139) = 2896 <0.000006>                                                         
  6 15:04:42.960724 read(30, "xxx"..., 94243) = 2896 <0.000006>                                                         
  7 15:04:42.960959 read(30, "xxx"..., 91347) = 2896 <0.000006>                                                         
  8 15:04:42.961212 read(30, "xxx"..., 88451) = 2896 <0.000006>                                                         
  9 15:04:42.961442 read(30, "xxx"..., 85555) = 2896 <0.000005>                                                         
 10 15:04:42.961704 read(30, "xxx"..., 82659) = 2896 <0.000006>                                                         
 11 15:04:42.961941 read(30, "xxx"..., 79763) = 2896 <0.000005>                                                         
 12 15:04:42.962205 read(30, "xxx"..., 76867) = 2896 <0.000007>                                                         
 13 15:04:42.962431 read(30, "xxx"..., 73971) = 2896 <0.000006>                                                         
 14 15:04:42.962696 read(30, "xxx"..., 71075) = 2896 <0.000007>                                                         
 15 15:04:42.962923 read(30, "xxx"..., 68179) = 2896 <0.000006>                                                         
 16 15:04:42.963185 read(30, "xxx"..., 65283) = 2896 <0.000006>                                                         
 17 15:04:42.963410 read(30, "xxx"..., 62387) = 2896 <0.000006>                                                         
 18 15:04:42.963686 read(30, "xxx"..., 59491) = 2896 <0.000008>                                                         
 19 15:04:42.963931 read(30, "xxx"..., 56595) = 8688 <0.000686>                                                         
 20 15:04:42.964704 read(30, "xxx"..., 47907) = 2896 <0.000008>                                                         
 21 15:04:42.964943 read(30, "xxx"..., 45011) = 2896 <0.000007>                                                         
 22 15:04:42.965146 read(30, "xxx"..., 42115) = 2896 <0.000005>                                                         
 23 15:04:42.965374 read(30, "xxx"..., 39219) = 2896 <0.000006>                                                         
 24 15:04:42.965636 read(30, "xxx"..., 36323) = 2896 <0.000006>                                                         
 25 15:04:42.965865 read(30, "xxx"..., 33427) = 2896 <0.000006>                                                         
 26 15:04:42.966134 read(30, "xxx"..., 30531) = 2896 <0.000006>                                                         
 27 15:04:42.966356 read(30, "xxx"..., 27635) = 2896 <0.000005>                                                         
 28 15:04:42.966618 read(30, "xxx"..., 24739) = 2896 <0.000005>                                                         
 29 15:04:42.966848 read(30, "xxx"..., 21843) = 2896 <0.000005>                                                         
 30 15:04:42.967122 read(30, "xxx"..., 18947) = 2896 <0.000006>                                                         
 31 15:04:42.967339 read(30, "xxx"..., 16051) = 2896 <0.000005>                                                         
 32 15:04:42.967604 read(30, "xxx"..., 13155) = 2896 <0.000005>                                                         
 33 15:04:42.967840 read(30, "xxx"..., 10259) = 2896 <0.000006>                                                         
 34 15:04:42.968112 read(30, "xxx"..., 7363) = 2896 <0.000007>                                                          
 35 15:04:42.968331 read(30, "xxx"..., 4467) = 2896 <0.000008>                                                          
 36 15:04:42.968611 read(30, "xxx"..., 1571) = 1571 <0.000008>                                                          
 37 15:04:43.145921 sendmsg(30, {msg_name(0)=NULL, msg_iov(3)=[{"\1\t\0\0\0\0\0\0\0\10", 10}, {"\1\0\0\0\0\0\0\0", 8}, {
    "\1", 1}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 19 <0.000008>

The existing `flush` method relied on the TCP_CORK being set and it would then pull out and put the cork back in. However this doesn't work, because putting the cork back in had to be done after all the packets are actually sent, otherwise the last incomplete packet will be delayed for the 200ms. Since we moved to the explicit application buffering for the large data and to the explicit flushing for the control messages this flush was deprecated. See sociomantic-tsunami/dmqproto#48

As per dmqproto#48, TCP_CORK based flushing doesn't work in general, however it's also no longer necessary, since we're doing explicit buffering via record batching in Consume, and for all the control messages, we're anyway flushing as soon as we send it. This makes TCP_NODELAY fit for this purpose: as soon as we write the control message it will be sent, and the batch will be sent as soon as sendmsg returns, so there's no need for explicit flushing. Fixes sociomantic-tsunami#48

As per dmqproto#48, TCP_CORK based flushing doesn't work in general, however it's also no longer necessary, since we're doing explicit buffering via record batching in Consume, and for all the control messages, we're anyway flushing as soon as we send it. This makes TCP_NODELAY fit for this purpose: as soon as we write the control message it will be sent, and the batch will be sent as soon as sendmsg returns, so there's no need for explicit flushing. Fixes #48

The existing `flush` method relied on the TCP_CORK being set and it would then pull out and put the cork back in. However this doesn't work, because putting the cork back in had to be done after all the packets are actually sent, otherwise the last incomplete packet will be delayed for the 200ms. Since we moved to the explicit application buffering for the large data and to the explicit flushing for the control messages this flush was deprecated. See sociomantic-tsunami/dmqproto#48

nemanja-boric-sociomantic · 2018-03-27T14:58:09Z

This will be solved when the dmqnode moves to the new swarm, but perhaps we could change the node always to pass true to the no_delay parameter of the NeoNode and then we can close this one?

david-eckardt-sociomantic · 2018-03-27T15:34:07Z

Until the dmqnode uses the new swarm I think fixing it in the config.ini is sufficient, to avoid another deployment.

gavin-norman-sociomantic added the type-bug label Feb 22, 2018

nemanja-boric-sociomantic mentioned this issue Feb 23, 2018

Deprecate TCP_CORK based flush. sociomantic-tsunami/swarm#291

Merged

nemanja-boric-sociomantic mentioned this issue Feb 23, 2018

Disable Nagle's algorithm in client and remove explicit flushing #49

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

200ms delay receiving Consume batches #48

200ms delay receiving Consume batches #48

gavin-norman-sociomantic commented Feb 22, 2018

nemanja-boric-sociomantic commented Feb 22, 2018

nemanja-boric-sociomantic commented Feb 22, 2018

nemanja-boric-sociomantic commented Feb 22, 2018

gavin-norman-sociomantic commented Feb 23, 2018

david-eckardt-sociomantic commented Feb 23, 2018 •

edited

Loading

nemanja-boric-sociomantic commented Feb 23, 2018 •

edited

Loading

nemanja-boric-sociomantic commented Feb 23, 2018

nemanja-boric-sociomantic commented Feb 23, 2018 •

edited

Loading

gavin-norman-sociomantic commented Feb 23, 2018

nemanja-boric-sociomantic commented Feb 23, 2018

nemanja-boric-sociomantic commented Feb 23, 2018

nemanja-boric-sociomantic commented Mar 27, 2018

david-eckardt-sociomantic commented Mar 27, 2018

200ms delay receiving Consume batches #48

200ms delay receiving Consume batches #48

Comments

gavin-norman-sociomantic commented Feb 22, 2018

nemanja-boric-sociomantic commented Feb 22, 2018

nemanja-boric-sociomantic commented Feb 22, 2018

nemanja-boric-sociomantic commented Feb 22, 2018

gavin-norman-sociomantic commented Feb 23, 2018

david-eckardt-sociomantic commented Feb 23, 2018 • edited Loading

nemanja-boric-sociomantic commented Feb 23, 2018 • edited Loading

nemanja-boric-sociomantic commented Feb 23, 2018

nemanja-boric-sociomantic commented Feb 23, 2018 • edited Loading

gavin-norman-sociomantic commented Feb 23, 2018

nemanja-boric-sociomantic commented Feb 23, 2018

nemanja-boric-sociomantic commented Feb 23, 2018

nemanja-boric-sociomantic commented Mar 27, 2018

david-eckardt-sociomantic commented Mar 27, 2018

david-eckardt-sociomantic commented Feb 23, 2018 •

edited

Loading

nemanja-boric-sociomantic commented Feb 23, 2018 •

edited

Loading

nemanja-boric-sociomantic commented Feb 23, 2018 •

edited

Loading