feat(sdk): Skip Sliding Sync `Response` if it's been received already #2395

Hywan · 2023-08-09T15:47:15Z

A Sliding Sync v4::Response contains a pos value. It helps to identify progress during several requests/responses. If two requests with the same pos are sent, the server must reply with the same response, as defined in the specification.

The corollary is: If a response contains a pos that has already been received, the client must ignore it, otherwise it can create duplications. For example, let a response containing sync operations, like DELETE or INSERT, with indexes: if this pair of operations are repeated twice with the same index, it creates a broken state.

To avoid this behaviour, this patch stores the last 20 position markers received from the server. If a response contains a pos that has already been received, a (new) error is returned. As with all the other errors, the sync-loop is stopped, and must restarted. The past positions survive to this restart, as for the rest of the state.

When the server replies with a M_UNKNOWN_POS, the pos and the past_positions are all cleared.

⚠️ Note to reviewer:

What I am sure about. Checking that a pos has not been received must be done at the beginning of the handle_response method.
What I am not sure about. Right now, the pos and past_positions are saved when handle_response starts to execute. The rest of the code in this method can fail. If an error is raised after the pos has been saved, the next request will pretend everything was alright, and the server will reply with a different response. It can be a problem. For example, if a list has not been updated because an error happened before (e.g. with encryption handling, or with room response handling), the next list update will be broken because the previous one didn't happen.
I was thinking of saving pos at the end of the method, so that we are sure that everything was done correctly, however it creates new problem: if the error was raised by the update of the list, the next request will use the same pos, the server will reply with the same response, and list updates is likely to be broken if it has partly been updated.
👉 Actually, the place where pos and past_positions must be placed is unclear to me yet:
- Should we consider that errors emitted during handle_response (apart from the pos check) should lead to closing the sliding sync session?
- Should we make list.update infallible? Making handle_encryption, handle_room_response, and process_and_take_response the only fallible places (with pos and past_positions to be saved after those calls)?

A Sliding Sync `v4::Response` contains a `pos` value. It helps to identify progress during several requests/responses. If two requests with the same `pos` are sent, the server **must** reply with the same response, as defined in the specification. The corollary is: If a response contains a `pos` that has already been received, the client **must** ignore it, otherwise it can create duplications. For example, let a response containing sync operations, like `DELETE` or `INSERT`, with indexes: if this pair of operations are repeated twice with the same index, it creates a broken state. To avoid this behaviour, this patch stores the last 20 position markers received from the server. If a response contains a `pos` that has already been received, a (new) error is returned. As with all the other errors, the sync-loop is stopped, and must restarted. The past positions survive to this restart, as for the rest of the state. When the server replies with a `M_UNKNOWN_POS`, the `pos` and the `past_positions` are all cleared.

codecov · 2023-08-09T16:29:18Z

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.04% 🎉

Comparison is base (80c9464) 77.78% compared to head (2b938e3) 77.82%.
Report is 23 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2395      +/-   ##
==========================================
+ Coverage   77.78%   77.82%   +0.04%     
==========================================
  Files         175      175              
  Lines       18619    18631      +12     
==========================================
+ Hits        14482    14500      +18     
+ Misses       4137     4131       -6

Files Changed	Coverage Δ
crates/matrix-sdk/src/sliding_sync/builder.rs	`66.07% <100.00%> (+0.30%)`	⬆️
crates/matrix-sdk/src/sliding_sync/mod.rs	`90.94% <100.00%> (+0.40%)`	⬆️

... and 2 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Hywan · 2023-08-10T09:32:26Z

After a long discussion with @bnjbvr, we've decided to save the positions (position and past_positions) at the end of handle_response.

For the sake of clarity, this patch extracts some code into its own `SlidingSync::expire_session`.

bnjbvr

LGTM, thanks! (and thanks for adding the expire_session method that I'll be using in #2362)
Only minor comments below, so feel free to address and merge 🚀

crates/matrix-sdk/src/sliding_sync/builder.rs

bnjbvr · 2023-08-10T10:56:23Z

crates/matrix-sdk/src/sliding_sync/mod.rs

@@ -100,6 +101,9 @@ pub(super) struct SlidingSyncInner {
    /// Position markers
    position: StdRwLock<SlidingSyncPositionMarkers>,

+    /// Past position markers.
+    past_positions: StdRwLock<RingBuffer<SlidingSyncPositionMarkers>>,


Any reason to store the SlidingSyncPositionMarkers overall, rather than just the plain pos as Option<String>? delta_token isn't returned by the server (or it might even be the same value, which should semantically cause an error too but won't), and we're only comparing against the pos fields anyways in the iteration loop below.

I thought it was easier. We manipulate a single type. The API looks consistent. It also allows to rollback to some pos and delta_token if required, or to debug, if required. I don't have a strong opinion here.

There's just potential to shoot ourselves in the feet later: comparing two pairs of SlidingSyncPositionMarkers isn't the same as independently comparing its fields against two pairs of position markers. Again, delta_token might not be even provided (or it should be a constant), but since your code only reads the pos: Option<String> field, it seems simpler to store only that, until there's need for more?

crates/matrix-sdk/src/sliding_sync/mod.rs

Co-authored-by: Benjamin Bouvier <[email protected]>

…nish. Imagine the following scenario: A request $R_1$ is sent. A response $S_1$ is received and is being handled. In the meantime, the sync-loop is instructed to skip over any remaining work in its iteration and to jump to the next iteration. As a consequence, $S_1$ is detached, but continues to run. In the meantime, a new request $R_2$ starts. Since $S_1$ has _not_ finished to be handled, the `pos` isn't updated yet, and $R_2$ starts with the _same_ `pos` as $R_1$. The impacts are the following: 1. Since the `pos` is the same, even if some parameters are different, the server will reply with the same response. It's a waste of time and resources (incl. network). 2. Receiving the same response could have corrupt the state. It has been fixed in matrix-org#2395 though. Point 2 has been addressed, but point 1 remains to be addresed. This patch fixes point 1. How? It changes the `RwLock` around `SlidingSyncInner::position` to a `Mutex`. An `OwnedMutexGuard` is fetched by locking the mutex when the request is generated (i.e. when `pos` is read to be put in the new request). This `OwnedMutexGuard` is kept during the entire lifetime of the request extend to the response handling. It is dropped/released when the response has been fully handled, or if any error happens along the process. It means that it's impossible for a new request to be generated and to be sent if a request and response is running. It solves point 1 in case of successful response, otherwise the `pos` isn't updated because of an error.

The patch matrix-org#2395 has introduced `SlidingSyncInner::past_positions` as a mechanism to filter duplicated responses. It was a problem because the sliding sync `ops` could easily create corrupted states if they were applied more than once. Since matrix-org#3664, `ops` are ignored. Now, `past_positions` create a problem with the sliding sync native implementation inside Synapse because `pos` can stay the same between multiple responses. While `past_positions` was helpful to fix bugs in the past, it's no longer necessary today. Moreover, it breaks an invariant about `pos`: we must consider it as a blackbox. It means we must ignore if a `pos` value has been received in the past or not. This invariant has been broken for good reasons, but it now creates new issues. This patch removes `past_positions`, along with the associated code (like `Error::ResponseAlreadyReceived` for example).

The patch #2395 has introduced `SlidingSyncInner::past_positions` as a mechanism to filter duplicated responses. It was a problem because the sliding sync `ops` could easily create corrupted states if they were applied more than once. Since #3664, `ops` are ignored. Now, `past_positions` create a problem with the sliding sync native implementation inside Synapse because `pos` can stay the same between multiple responses. While `past_positions` was helpful to fix bugs in the past, it's no longer necessary today. Moreover, it breaks an invariant about `pos`: we must consider it as a blackbox. It means we must ignore if a `pos` value has been received in the past or not. This invariant has been broken for good reasons, but it now creates new issues. This patch removes `past_positions`, along with the associated code (like `Error::ResponseAlreadyReceived` for example).

Hywan requested a review from a team as a code owner August 9, 2023 15:47

Hywan requested review from jplatte and removed request for a team August 9, 2023 15:47

Hywan mentioned this pull request Aug 9, 2023

feat(sync service): don't intertwine the state of both sliding syncs #2362

Merged

test(ui): Adjust pos.

d20e823

bnjbvr self-requested a review August 9, 2023 16:43

feat(sdk): Save positions at the end of handle_response.

5d8b0c1

Hywan removed the request for review from jplatte August 10, 2023 10:04

feat(sdk): Create SlidingSync::expire_session.

8176902

For the sake of clarity, this patch extracts some code into its own `SlidingSync::expire_session`.

bnjbvr approved these changes Aug 10, 2023

View reviewed changes

doc(sdk): Fix a typo.

2b938e3

Co-authored-by: Benjamin Bouvier <[email protected]>

Hywan enabled auto-merge August 10, 2023 11:22

Hywan merged commit 0af918b into matrix-org:main Aug 10, 2023

Hywan mentioned this pull request Aug 10, 2023

Holes/gaps in room list due to duplicate room IDs from SS element-hq/element-x-ios#1157

Closed

Hywan mentioned this pull request Aug 17, 2023

fix(sdk): Preventing starting a new request if the previous didn't finish #2414

Merged

bmarty mentioned this pull request Aug 24, 2023

Room appears twice in room list element-hq/element-x-android#1005

Closed

Hywan mentioned this pull request Aug 12, 2024

fix(sdk): Remove SlidingSyncInner::past_positions #3833

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(sdk): Skip Sliding Sync `Response` if it's been received already #2395

feat(sdk): Skip Sliding Sync `Response` if it's been received already #2395

Hywan commented Aug 9, 2023 •

edited

Loading

codecov bot commented Aug 9, 2023 •

edited

Loading

Hywan commented Aug 10, 2023

bnjbvr left a comment

bnjbvr Aug 10, 2023

Hywan Aug 10, 2023

bnjbvr Aug 10, 2023

feat(sdk): Skip Sliding Sync Response if it's been received already #2395

feat(sdk): Skip Sliding Sync Response if it's been received already #2395

Conversation

Hywan commented Aug 9, 2023 • edited Loading

codecov bot commented Aug 9, 2023 • edited Loading

Codecov Report

Hywan commented Aug 10, 2023

bnjbvr left a comment

Choose a reason for hiding this comment

bnjbvr Aug 10, 2023

Choose a reason for hiding this comment

Hywan Aug 10, 2023

Choose a reason for hiding this comment

bnjbvr Aug 10, 2023

Choose a reason for hiding this comment

feat(sdk): Skip Sliding Sync `Response` if it's been received already #2395

feat(sdk): Skip Sliding Sync `Response` if it's been received already #2395

Hywan commented Aug 9, 2023 •

edited

Loading

codecov bot commented Aug 9, 2023 •

edited

Loading