addresses the issue with full buffers in Step 2b

p0lyn0mial · p0lyn0mial · commit 811593790884 · 2022-02-03T10:07:11.000+01:00
diff --git a/keps/sig-api-machinery/3157-watch-list/README.md b/keps/sig-api-machinery/3157-watch-list/README.md
@@ -385,6 +385,32 @@ Since the cacheWatcher starts processing the cacheWatcher.incoming channel only
 In that case, it will be added to the list of blockedWatchers and will be given another chance to deliver an event after all nonblocking watchers have sent the event.
 All watchers that have failed to deliver the event will be closed.
 
+Closing the watchers would make the clients retry the requests and download the entire dataset again even though they might have received a complete list before.
+To mitigate the issue, before sending the initial event we propose:
+
+Compare the bookmarkAfterResourceVersion (from Step 2) with the current RV the watchCache is on
+and wait until the difference between the RVs is < 1000 (the buffer size).
+If the difference is greater than that it seems there is no need to go on since the buffer could be filled before we will receive an event with the expected RV.
+Assuming all updates would be for the resource the watch request was opened for (which seems unlikely).
+In case the watchCache was unable to catch up to the bookmarkAfterResourceVersion for some timeout value hard-close (ends the current connection by tearing down the current TCP connection with the client) the current connection so that client re-connects to a different API server with most-up to date cache.
+Taking into account the baseline etcd performance numbers waiting for 10 seconds will allow us to receive ~5K events, assuming ~500 QPS throughput (see https://etcd.io/docs/v3.4/op-guide/performance/)
+
+Once we are past this step (we know the difference is smaller) and the buffer fills up we:
+
+- case-1: won’t close the connection immediately if the bookmark event with the expected RV exists in the buffer.
+    In that case, we will deliver the initial events, any other events we have received which RVs are <= bookmarkAfterResourceVersion, and finally the bookmark event, and only then we will soft-close (simply ends the current connection without tearing down the TCP connection) the current connection.
+    An informer will reconnect with the RV from the bookmark event.
+    Note that any new event received was ignored since the buffer was full.
+
+- case-2: soft-close the connection if the bookmark event with the expected RV for some reason doesn't exist in the buffer.
+          An informer will reconnect arriving at the step that compares the RVs first.
+
+In the future we could improve the way the buffer is managed:
+- cap the size (cannot allocate more than X MB of memory) of the buffer
+- make the buffer dynamic - especially when the difference between RVs is > than 1000
+- inject new events directly to the initial list, i.e. to have the initial list loop consume the channel directly and avoid to wait for the whole initial list being processed before.
+- maybe even apply some compression techniques to the buffer
+
 Note: The RV is effectively a global counter that is incremented every time an object is updated.
 This imposes a global order of events. It is equivalent to a LIST followed by a WATCH request.