Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sliding Sync: Handle room_subscriptions that increase the timeline_limit #17503

Closed
wants to merge 7 commits into from

Conversation

erikjohnston
Copy link
Member

EX will have a low timeline limit in the standard room list, but will then use room subscriptions to request more history for certain rooms (i.e. the rooms the client things are at the top of the room list).

So if we see a timeline limit being increased for a room, then we need to send down some historical events. This is done by ignoring any from token for the timeline section and setting limited=true

@erikjohnston erikjohnston marked this pull request as ready for review July 30, 2024 17:41
@erikjohnston erikjohnston requested a review from a team as a code owner July 30, 2024 17:41
Comment on lines +1599 to +1602
# If we're ignoring the timeline bound we *must* set limited to
# true, as otherwise the client will append the received events
# to the timeline, rather than replacing it.
limited = True
Copy link
Contributor

@MadLittleMods MadLittleMods Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems dubious. If some client replaces everything on limited = True, I don't think that's the case for all clients.

For example, for an incremental sync if there are more events than the timeline_limit, it will be limited = True. But that just means there is a gap they should paginate and fill in with /messages.

limited doesn't mean the timeline should be replaced.

Even initial = True doesn't mean the timeline should be replaced.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is sort of an abuse of what limited means in sync v2, though generally the way clients deal with limited is to remove all existing messages from the timeline and only display the the new events (as clients don't generally render a "gap" between two timeline chunks).

What the SS proxy does is to set initial true, but not include unchanged required_state, which feels weird.

As I see it the options to deal with timeline trickling (in the short term) are:

  1. Set initial: true and include the full room data (including state)
  2. Do what the proxy does and set initial: true, but don't send down unchanged state
  3. Do what this PR does and set limited: true and don't send down unchanged state

I think the last one is the least worst. It's also worth noting that the client (mostly) opts into this behaviour by doing explicit room subscriptions with a differing timeline_limit

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Side note: if we do do this PR we should update the docs for limited)

Copy link
Contributor

@MadLittleMods MadLittleMods Jul 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pattern laid out is to use /messages. With the current Sliding Sync API, adjusting/repurposing fields/behaviors to mean other things is just trying to skirt the problem and adds tenuous intricacy to complexity.

If they're trying to get more timeline, one solution could be to do another initial sync with the timeline_limit that they want.

Clients should handle gaps though. Or at-least we shouldn't punish clients who do handle gaps.

For example, Hydrogen handles this with fragments and gap filling since it has offline support and doesn't throw away its work.

If we still think Sliding Sync should somehow address this, then I think we need to apply more thought to the API design.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear: this is implementing the behaviour that the sliding sync proxy already supports.

The way that the rust sdk currently starts is to:

  1. Do an initial request for top 20 rooms and timeline limit of 1. This gets a page of results fast.
  2. Do a second request for top 20 rooms (via room subs) with timeline of 20. This preloads the room lists for the top rooms
  3. Expand the range request-by-request, with timeline limit of 1.

I agree that there is something a bit confusing with allowing the client to change timeline limit, but TBF they are opting into this behaviour (to an extent).

The pattern laid out is to use /messages. With the current Sliding Sync API, adjusting/repurposing fields/behaviors to mean other things is just trying to skirt the problem and adds tenuous intricacy to complexity.

We would be tweaking limited to mean "these are the most recent N events and there may be a gap, if you've changed the timeline limit you may have already seen some of these events".

The alternative is to just use initial, which really does mean "throw away everything you have".

Clients should handle gaps though. Or at-least we shouldn't punish clients who do handle gaps.

For example, Hydrogen handles this with fragments and gap filling since it has offline support and doesn't throw away its work.

I don't think this really punishes clients that handle gaps that much, it means that if they change the timeline limit they need to handle the fact that they may have already seen a subset of the events that have been returned.

If we do just use initial, then it's up to the clients whether they want to just throw everything they have away or just use /messages

If we still think Sliding Sync should somehow address this, then I think we need to apply more thought to the API design.

We're not in a great position to change thing around right now, though I agree it should go on the list of things we should look into once we've reimplemented everything.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing a second initial sync requires changing the app, which we can probably do but really right now I want us to focus on getting parity with the sliding sync proxy.

Given an initial sync would cover the use case in an idiomatic way, I don't think we should introduce this convoluted behavior.

This is an active regression in behaviour and UX from the SS proxy. It's clear we need to do something here to allow the apps to continue to operate well.

To be clear, using limited: true is the wrong option here. It throws a wrench in what an initial vs incremental sync mean.

limited could be true in a legitimate scenario of too many messages but now we're also saying it can be true (along with a batch of timeline events) in a new random scenario of expanding timeline_limit which makes the downstream processing need to worry about how the request asked for things now.

Yes, we're subtly changing the meaning of limited here, though I disagree that it is that confusing. I did try and go through the experiment about how this works for offline-first clients, and the semantics seem to fit perfectly? The limited flag essentially becomes a flag as to whether the new chunk of events can be appended to the last timeline chunk, or whether it needs to be considered/handled as a new chunk of events (using the same logic they already have).

Additionally, the "opt-in" behavior you're talking about is also just another normal Sliding Sync request. Adjusting the ranges and timeline_limit is a normal thing to do that we don't need to introduce a foot-gun into.

It's really unclear to me what the semantics of increasing the timeline limit should be. Explicitly increasing the timeline limit very much feels like the app wants to get a bigger chunk of timeline down 🤷

I guess there is a time where this sort of happens implicitly, where we have two lists with different timeline limits, e.g. an "all-room" list with a limit of 1 and a "top-20" list with a limit of 20. We'd want to be careful to handle that sanely, though if an old room gets an update (which bumps it to the top-20 list) I don't think its insane for it to include the last 20 timeline events.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So to summarise my understanding here:

  • the proxy has a bug whereby timeline trickling (increasing the timeline limit for a room the client already knows about) sets initial: true when returning the larger timeline, but does NOT send required_state, which breaks the description of initial.
  • Erik is proposing to set limited: true instead as a better descriptor.
  • Eric thinks this is a footgun as it convolutes handling of gappy syncs, and counter-proposes sending required_state like the proxy should have technically been doing.
  • Erik is worried this is going to impact performance due to the increase in bytes-over-the-wire.

My opinion is that timeline trickling is awful and should be replaced with a bulk /messages endpoint. /sync is confusing enough to get right without having to think about historical state/messages. Far better to treat this endpoint as purely for real-time bare essential information, and defer to other APIs for more data. Some early sketches of other APIs (as well as use cases for the bulk endpoint) can be found in this document which is over 1 year old.

However, we do not live in this future. We live in the here and now, where we want to get SSS landed in Synapse ASAP and then iterate on any warts such as this. As a result, I don't really care which proposal we go with.

If neither of you can agree on the semantics, then might I suggest a tiebreak and literally just implement the behaviour of the proxy (that is initial without sending required_state iff the timeline_limit is changed for the room). That will definitely work with real existing clients (no client changes needed, no additional bandwidth increase) since the proxy is the only complete impl of SS. That feels no less warty than the other 2 suggested proposals here.

Failing that, I would lean towards Erik's limited: true on the basis that it is the most pragmatic solution here which won't negatively affect performance: which is literally the entire point of this API.

Copy link
Contributor

@MadLittleMods MadLittleMods Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's really nice to see the candid agreement on how bizarre this behavior is and plans for a better future!

⭐ My actual proposal is that the client can use an initial sync request and ask for timeline_limit: 20 and required_state: [] which means barely any extra bytes over the wire (just the small meta data like room name). We can abuse the sliding sync endpoint to bulk fetch messages in a room and we don't need to introduce any of this convoluted behavior even in the interim.

Copy link
Contributor

@MadLittleMods MadLittleMods Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the weekly meeting, it seems like the only downside to my suggestion is that it requires client-side changes. The ElementX/Rust client team is out today so we aren't able to see whether this is a big deal or not.

Besides my suggestion, the better of the proposed options if we really want to push this through: use initial: true without sending required_state (match the proxy). This option optimizes for no client changes necessary which is the only reason to not use my suggestion, and is at-least some sort of approximation of what initial: true actually means which seems better than messing with the semantics of limited. While I disagree with moving forward with this, it is amenable to more forward so long as this doesn't suffer from the bump once problem.


Besides the existing things already discussed in this discussion thread, we also went over the possibility of adding a completely new flag to accurately describe and trigger this behavior or even a new field like initial_timeline that shows up whenever the timeline_limit is bumped. It's just more ugly workarounds that require client-side changes anyway but it's something that doesn't mess with what we have.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conversation continued in #17579 (comment)

# If we're ignoring the timeline bound we *must* set limited to
# true, as otherwise the client will append the received events
# to the timeline, rather than replacing it.
limited = True
Copy link
Contributor

@MadLittleMods MadLittleMods Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't the client expanding their room timeline by paginating /messages using the prev_batch token given in the response?

That seems just as good if they're adding room subscriptions just to get more timeline events.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, so this is really a (bit) of an abuse of the SS API so the client can avoid doing e.g. 20 pagination requests simultaneously.

Comment on lines +1598 to +1602
if ignore_timeline_bound:
# If we're ignoring the timeline bound we *must* set limited to
# true, as otherwise the client will append the received events
# to the timeline, rather than replacing it.
limited = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One edge case this doesn't handle is when there are only a few events in the room total. This will claim limited = True even if the timeline_limit fetches all events in the room.

Comment on lines +1513 to +1520
if room_status.timeline_limit is not None and (
room_status.timeline_limit < room_sync_config.timeline_limit
):
# If the timeline limit has been increased since previous
# requests then we treat it as request for more events. We do
# this by sending down a `limited` sync, ignoring the from
# bound.
ignore_timeline_bound = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One edge case we thought of during our weekly meeting, is that it seems like it's only possible to bump the timeline_limit once to get a batch of timeline messages. So if you initially request everything with timeline_limit: 1 and then timeline_limit: 20 to get a bunch of timeline messages, requesting with timeline_limit: 20 again won't give you a batch of timeline again. You would have to request timeline_limit: 21, etc into infinity every time you want to fetch the initial batch of timeline.

This comes into play even with ElementX because it grows and shrinks it's range over time and will want to fetch timeline for rooms that comes into view again.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is actually covered in the PR, and I need to add tests, but: we when the timeline limit is reduced we update the timeline limit we stored, so we do correctly handle increase -> decrease -> increase.

@erikjohnston
Copy link
Member Author

I'm going to close this PR in favour of: #17579, there's been a lot of changes to the underlying code, so I thought it'd make sense to start from fresh

erikjohnston added a commit that referenced this pull request Aug 20, 2024
This supersedes #17503, given the per-connection state is being heavily
rewritten it felt easier to recreate the PR on top of that work.

This correctly handles the case of timeline limits going up and down.

This does not handle changes in `required_state`, but that can be done
as a separate PR.

Based on #17575.

---------

Co-authored-by: Eric Eastwood <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants