Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(sdk): Skip Sliding Sync Response if it's been received already #2395

Merged
merged 5 commits into from
Aug 10, 2023

Conversation

Hywan
Copy link
Member

@Hywan Hywan commented Aug 9, 2023

A Sliding Sync v4::Response contains a pos value. It helps to identify progress during several requests/responses. If two requests with the same pos are sent, the server must reply with the same response, as defined in the specification.

The corollary is: If a response contains a pos that has already been received, the client must ignore it, otherwise it can create duplications. For example, let a response containing sync operations, like DELETE or INSERT, with indexes: if this pair of operations are repeated twice with the same index, it creates a broken state.

To avoid this behaviour, this patch stores the last 20 position markers received from the server. If a response contains a pos that has already been received, a (new) error is returned. As with all the other errors, the sync-loop is stopped, and must restarted. The past positions survive to this restart, as for the rest of the state.

When the server replies with a M_UNKNOWN_POS, the pos and the past_positions are all cleared.

⚠️ Note to reviewer:

  • What I am sure about. Checking that a pos has not been received must be done at the beginning of the handle_response method.
  • What I am not sure about. Right now, the pos and past_positions are saved when handle_response starts to execute. The rest of the code in this method can fail. If an error is raised after the pos has been saved, the next request will pretend everything was alright, and the server will reply with a different response. It can be a problem. For example, if a list has not been updated because an error happened before (e.g. with encryption handling, or with room response handling), the next list update will be broken because the previous one didn't happen.
    I was thinking of saving pos at the end of the method, so that we are sure that everything was done correctly, however it creates new problem: if the error was raised by the update of the list, the next request will use the same pos, the server will reply with the same response, and list updates is likely to be broken if it has partly been updated.
    👉 Actually, the place where pos and past_positions must be placed is unclear to me yet:
    • Should we consider that errors emitted during handle_response (apart from the pos check) should lead to closing the sliding sync session?
    • Should we make list.update infallible? Making handle_encryption, handle_room_response, and process_and_take_response the only fallible places (with pos and past_positions to be saved after those calls)?

A Sliding Sync `v4::Response` contains a `pos` value. It helps to identify
progress during several requests/responses. If two requests with the
same `pos` are sent, the server **must** reply with the same response,
as defined in the specification.

The corollary is: If a response contains a `pos` that has already
been received, the client **must** ignore it, otherwise it can create
duplications. For example, let a response containing sync operations,
like `DELETE` or `INSERT`, with indexes: if this pair of operations are
repeated twice with the same index, it creates a broken state.

To avoid this behaviour, this patch stores the last 20 position markers
received from the server. If a response contains a `pos` that has
already been received, a (new) error is returned. As with all the other
errors, the sync-loop is stopped, and must restarted. The past positions
survive to this restart, as for the rest of the state.

When the server replies with a `M_UNKNOWN_POS`, the `pos` and the
`past_positions` are all cleared.
@Hywan Hywan requested a review from a team as a code owner August 9, 2023 15:47
@Hywan Hywan requested review from jplatte and removed request for a team August 9, 2023 15:47
@codecov
Copy link

codecov bot commented Aug 9, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.04% 🎉

Comparison is base (80c9464) 77.78% compared to head (2b938e3) 77.82%.
Report is 23 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2395      +/-   ##
==========================================
+ Coverage   77.78%   77.82%   +0.04%     
==========================================
  Files         175      175              
  Lines       18619    18631      +12     
==========================================
+ Hits        14482    14500      +18     
+ Misses       4137     4131       -6     
Files Changed Coverage Δ
crates/matrix-sdk/src/sliding_sync/builder.rs 66.07% <100.00%> (+0.30%) ⬆️
crates/matrix-sdk/src/sliding_sync/mod.rs 90.94% <100.00%> (+0.40%) ⬆️

... and 2 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@bnjbvr bnjbvr self-requested a review August 9, 2023 16:43
@Hywan
Copy link
Member Author

Hywan commented Aug 10, 2023

After a long discussion with @bnjbvr, we've decided to save the positions (position and past_positions) at the end of handle_response.

@Hywan Hywan removed the request for review from jplatte August 10, 2023 10:04
For the sake of clarity, this patch extracts some code into its own
`SlidingSync::expire_session`.
Copy link
Member

@bnjbvr bnjbvr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks! (and thanks for adding the expire_session method that I'll be using in #2362)
Only minor comments below, so feel free to address and merge 🚀

crates/matrix-sdk/src/sliding_sync/builder.rs Show resolved Hide resolved
@@ -100,6 +101,9 @@ pub(super) struct SlidingSyncInner {
/// Position markers
position: StdRwLock<SlidingSyncPositionMarkers>,

/// Past position markers.
past_positions: StdRwLock<RingBuffer<SlidingSyncPositionMarkers>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to store the SlidingSyncPositionMarkers overall, rather than just the plain pos as Option<String>? delta_token isn't returned by the server (or it might even be the same value, which should semantically cause an error too but won't), and we're only comparing against the pos fields anyways in the iteration loop below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was easier. We manipulate a single type. The API looks consistent. It also allows to rollback to some pos and delta_token if required, or to debug, if required. I don't have a strong opinion here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's just potential to shoot ourselves in the feet later: comparing two pairs of SlidingSyncPositionMarkers isn't the same as independently comparing its fields against two pairs of position markers. Again, delta_token might not be even provided (or it should be a constant), but since your code only reads the pos: Option<String> field, it seems simpler to store only that, until there's need for more?

crates/matrix-sdk/src/sliding_sync/mod.rs Outdated Show resolved Hide resolved
Co-authored-by: Benjamin Bouvier <[email protected]>
@Hywan Hywan enabled auto-merge August 10, 2023 11:22
@Hywan Hywan merged commit 0af918b into matrix-org:main Aug 10, 2023
Hywan added a commit to Hywan/matrix-rust-sdk that referenced this pull request Aug 17, 2023
…nish.

Imagine the following scenario:

A request $R_1$ is sent. A response $S_1$ is received and is being
handled. In the meantime, the sync-loop is instructed to skip over any
remaining work in its iteration and to jump to the next iteration. As a
consequence, $S_1$ is detached, but continues to run. In the meantime, a
new request $R_2$ starts. Since $S_1$  has _not_ finished to be handled,
the `pos` isn't updated yet, and $R_2$ starts with the _same_ `pos`
as $R_1$.

The impacts are the following:

1. Since the `pos` is the same, even if some parameters are different,
   the server will reply with the same response. It's a waste of time
   and resources (incl. network).
2. Receiving the same response could have corrupt the state. It has been
   fixed in matrix-org#2395
   though.

Point 2 has been addressed, but point 1 remains to be addresed. This
patch fixes point 1.

How? It changes the `RwLock` around `SlidingSyncInner::position` to
a `Mutex`. An `OwnedMutexGuard` is fetched by locking the mutex when
the request is generated (i.e. when `pos` is read to be put in the new
request). This `OwnedMutexGuard` is kept during the entire lifetime
of the request extend to the response handling. It is dropped/released
when the response has been fully handled, or if any error happens along
the process.

It means that it's impossible for a new request to be generated and to
be sent if a request and response is running. It solves point 1 in case
of successful response, otherwise the `pos` isn't updated because of
an error.
Hywan added a commit to Hywan/matrix-rust-sdk that referenced this pull request Aug 12, 2024
The patch matrix-org#2395 has
introduced `SlidingSyncInner::past_positions` as a mechanism to filter
duplicated responses. It was a problem because the sliding sync `ops`
could easily create corrupted states if they were applied more than
once.

Since matrix-org#3664, `ops`
are ignored.

Now, `past_positions` create a problem with the sliding sync native
implementation inside Synapse because `pos` can stay the same between
multiple responses.

While `past_positions` was helpful to fix bugs in the past, it's no
longer necessary today. Moreover, it breaks an invariant about `pos`: we
must consider it as a blackbox. It means we must ignore if a `pos` value
has been received in the past or not. This invariant has been broken for
good reasons, but it now creates new issues.

This patch removes `past_positions`, along with the associated code
(like `Error::ResponseAlreadyReceived` for example).
Hywan added a commit to Hywan/matrix-rust-sdk that referenced this pull request Aug 12, 2024
The patch matrix-org#2395 has
introduced `SlidingSyncInner::past_positions` as a mechanism to filter
duplicated responses. It was a problem because the sliding sync `ops`
could easily create corrupted states if they were applied more than
once.

Since matrix-org#3664, `ops`
are ignored.

Now, `past_positions` create a problem with the sliding sync native
implementation inside Synapse because `pos` can stay the same between
multiple responses.

While `past_positions` was helpful to fix bugs in the past, it's no
longer necessary today. Moreover, it breaks an invariant about `pos`: we
must consider it as a blackbox. It means we must ignore if a `pos` value
has been received in the past or not. This invariant has been broken for
good reasons, but it now creates new issues.

This patch removes `past_positions`, along with the associated code
(like `Error::ResponseAlreadyReceived` for example).
Hywan added a commit that referenced this pull request Aug 12, 2024
The patch #2395 has
introduced `SlidingSyncInner::past_positions` as a mechanism to filter
duplicated responses. It was a problem because the sliding sync `ops`
could easily create corrupted states if they were applied more than
once.

Since #3664, `ops`
are ignored.

Now, `past_positions` create a problem with the sliding sync native
implementation inside Synapse because `pos` can stay the same between
multiple responses.

While `past_positions` was helpful to fix bugs in the past, it's no
longer necessary today. Moreover, it breaks an invariant about `pos`: we
must consider it as a blackbox. It means we must ignore if a `pos` value
has been received in the past or not. This invariant has been broken for
good reasons, but it now creates new issues.

This patch removes `past_positions`, along with the associated code
(like `Error::ResponseAlreadyReceived` for example).
bnjbvr pushed a commit that referenced this pull request Aug 12, 2024
The patch #2395 has
introduced `SlidingSyncInner::past_positions` as a mechanism to filter
duplicated responses. It was a problem because the sliding sync `ops`
could easily create corrupted states if they were applied more than
once.

Since #3664, `ops`
are ignored.

Now, `past_positions` create a problem with the sliding sync native
implementation inside Synapse because `pos` can stay the same between
multiple responses.

While `past_positions` was helpful to fix bugs in the past, it's no
longer necessary today. Moreover, it breaks an invariant about `pos`: we
must consider it as a blackbox. It means we must ignore if a `pos` value
has been received in the past or not. This invariant has been broken for
good reasons, but it now creates new issues.

This patch removes `past_positions`, along with the associated code
(like `Error::ResponseAlreadyReceived` for example).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants