-
Notifications
You must be signed in to change notification settings - Fork 269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(sdk): Skip Sliding Sync Response
if it's been received already
#2395
Conversation
A Sliding Sync `v4::Response` contains a `pos` value. It helps to identify progress during several requests/responses. If two requests with the same `pos` are sent, the server **must** reply with the same response, as defined in the specification. The corollary is: If a response contains a `pos` that has already been received, the client **must** ignore it, otherwise it can create duplications. For example, let a response containing sync operations, like `DELETE` or `INSERT`, with indexes: if this pair of operations are repeated twice with the same index, it creates a broken state. To avoid this behaviour, this patch stores the last 20 position markers received from the server. If a response contains a `pos` that has already been received, a (new) error is returned. As with all the other errors, the sync-loop is stopped, and must restarted. The past positions survive to this restart, as for the rest of the state. When the server replies with a `M_UNKNOWN_POS`, the `pos` and the `past_positions` are all cleared.
Codecov ReportPatch coverage:
Additional details and impacted files
☔ View full report in Codecov by Sentry. |
After a long discussion with @bnjbvr, we've decided to save the positions ( |
For the sake of clarity, this patch extracts some code into its own `SlidingSync::expire_session`.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks! (and thanks for adding the expire_session
method that I'll be using in #2362)
Only minor comments below, so feel free to address and merge 🚀
@@ -100,6 +101,9 @@ pub(super) struct SlidingSyncInner { | |||
/// Position markers | |||
position: StdRwLock<SlidingSyncPositionMarkers>, | |||
|
|||
/// Past position markers. | |||
past_positions: StdRwLock<RingBuffer<SlidingSyncPositionMarkers>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to store the SlidingSyncPositionMarkers
overall, rather than just the plain pos
as Option<String>
? delta_token
isn't returned by the server (or it might even be the same value, which should semantically cause an error too but won't), and we're only comparing against the pos
fields anyways in the iteration loop below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it was easier. We manipulate a single type. The API looks consistent. It also allows to rollback to some pos
and delta_token
if required, or to debug, if required. I don't have a strong opinion here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's just potential to shoot ourselves in the feet later: comparing two pairs of SlidingSyncPositionMarkers
isn't the same as independently comparing its fields against two pairs of position markers. Again, delta_token might not be even provided (or it should be a constant), but since your code only reads the pos: Option<String>
field, it seems simpler to store only that, until there's need for more?
Co-authored-by: Benjamin Bouvier <[email protected]>
…nish. Imagine the following scenario: A request $R_1$ is sent. A response $S_1$ is received and is being handled. In the meantime, the sync-loop is instructed to skip over any remaining work in its iteration and to jump to the next iteration. As a consequence, $S_1$ is detached, but continues to run. In the meantime, a new request $R_2$ starts. Since $S_1$ has _not_ finished to be handled, the `pos` isn't updated yet, and $R_2$ starts with the _same_ `pos` as $R_1$. The impacts are the following: 1. Since the `pos` is the same, even if some parameters are different, the server will reply with the same response. It's a waste of time and resources (incl. network). 2. Receiving the same response could have corrupt the state. It has been fixed in matrix-org#2395 though. Point 2 has been addressed, but point 1 remains to be addresed. This patch fixes point 1. How? It changes the `RwLock` around `SlidingSyncInner::position` to a `Mutex`. An `OwnedMutexGuard` is fetched by locking the mutex when the request is generated (i.e. when `pos` is read to be put in the new request). This `OwnedMutexGuard` is kept during the entire lifetime of the request extend to the response handling. It is dropped/released when the response has been fully handled, or if any error happens along the process. It means that it's impossible for a new request to be generated and to be sent if a request and response is running. It solves point 1 in case of successful response, otherwise the `pos` isn't updated because of an error.
The patch matrix-org#2395 has introduced `SlidingSyncInner::past_positions` as a mechanism to filter duplicated responses. It was a problem because the sliding sync `ops` could easily create corrupted states if they were applied more than once. Since matrix-org#3664, `ops` are ignored. Now, `past_positions` create a problem with the sliding sync native implementation inside Synapse because `pos` can stay the same between multiple responses. While `past_positions` was helpful to fix bugs in the past, it's no longer necessary today. Moreover, it breaks an invariant about `pos`: we must consider it as a blackbox. It means we must ignore if a `pos` value has been received in the past or not. This invariant has been broken for good reasons, but it now creates new issues. This patch removes `past_positions`, along with the associated code (like `Error::ResponseAlreadyReceived` for example).
The patch matrix-org#2395 has introduced `SlidingSyncInner::past_positions` as a mechanism to filter duplicated responses. It was a problem because the sliding sync `ops` could easily create corrupted states if they were applied more than once. Since matrix-org#3664, `ops` are ignored. Now, `past_positions` create a problem with the sliding sync native implementation inside Synapse because `pos` can stay the same between multiple responses. While `past_positions` was helpful to fix bugs in the past, it's no longer necessary today. Moreover, it breaks an invariant about `pos`: we must consider it as a blackbox. It means we must ignore if a `pos` value has been received in the past or not. This invariant has been broken for good reasons, but it now creates new issues. This patch removes `past_positions`, along with the associated code (like `Error::ResponseAlreadyReceived` for example).
The patch #2395 has introduced `SlidingSyncInner::past_positions` as a mechanism to filter duplicated responses. It was a problem because the sliding sync `ops` could easily create corrupted states if they were applied more than once. Since #3664, `ops` are ignored. Now, `past_positions` create a problem with the sliding sync native implementation inside Synapse because `pos` can stay the same between multiple responses. While `past_positions` was helpful to fix bugs in the past, it's no longer necessary today. Moreover, it breaks an invariant about `pos`: we must consider it as a blackbox. It means we must ignore if a `pos` value has been received in the past or not. This invariant has been broken for good reasons, but it now creates new issues. This patch removes `past_positions`, along with the associated code (like `Error::ResponseAlreadyReceived` for example).
The patch #2395 has introduced `SlidingSyncInner::past_positions` as a mechanism to filter duplicated responses. It was a problem because the sliding sync `ops` could easily create corrupted states if they were applied more than once. Since #3664, `ops` are ignored. Now, `past_positions` create a problem with the sliding sync native implementation inside Synapse because `pos` can stay the same between multiple responses. While `past_positions` was helpful to fix bugs in the past, it's no longer necessary today. Moreover, it breaks an invariant about `pos`: we must consider it as a blackbox. It means we must ignore if a `pos` value has been received in the past or not. This invariant has been broken for good reasons, but it now creates new issues. This patch removes `past_positions`, along with the associated code (like `Error::ResponseAlreadyReceived` for example).
A Sliding Sync
v4::Response
contains apos
value. It helps to identify progress during several requests/responses. If two requests with the samepos
are sent, the server must reply with the same response, as defined in the specification.The corollary is: If a response contains a
pos
that has already been received, the client must ignore it, otherwise it can create duplications. For example, let a response containing sync operations, likeDELETE
orINSERT
, with indexes: if this pair of operations are repeated twice with the same index, it creates a broken state.To avoid this behaviour, this patch stores the last 20 position markers received from the server. If a response contains a
pos
that has already been received, a (new) error is returned. As with all the other errors, the sync-loop is stopped, and must restarted. The past positions survive to this restart, as for the rest of the state.When the server replies with a
M_UNKNOWN_POS
, thepos
and thepast_positions
are all cleared.pos
has not been received must be done at the beginning of thehandle_response
method.pos
andpast_positions
are saved whenhandle_response
starts to execute. The rest of the code in this method can fail. If an error is raised after thepos
has been saved, the next request will pretend everything was alright, and the server will reply with a different response. It can be a problem. For example, if a list has not been updated because an error happened before (e.g. with encryption handling, or with room response handling), the next list update will be broken because the previous one didn't happen.I was thinking of saving
pos
at the end of the method, so that we are sure that everything was done correctly, however it creates new problem: if the error was raised by the update of the list, the next request will use the samepos
, the server will reply with the same response, and list updates is likely to be broken if it has partly been updated.👉 Actually, the place where
pos
andpast_positions
must be placed is unclear to me yet:handle_response
(apart from thepos
check) should lead to closing the sliding sync session?list.update
infallible? Makinghandle_encryption
,handle_room_response
, andprocess_and_take_response
the only fallible places (withpos
andpast_positions
to be saved after those calls)?