-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sliding Sync: Handle room_subscriptions that increase the timeline_limit
#17503
Conversation
# If we're ignoring the timeline bound we *must* set limited to | ||
# true, as otherwise the client will append the received events | ||
# to the timeline, rather than replacing it. | ||
limited = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems dubious. If some client replaces everything on limited = True
, I don't think that's the case for all clients.
For example, for an incremental sync if there are more events than the timeline_limit
, it will be limited = True
. But that just means there is a gap they should paginate and fill in with /messages
.
limited
doesn't mean the timeline
should be replaced.
Even initial = True
doesn't mean the timeline
should be replaced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is sort of an abuse of what limited
means in sync v2, though generally the way clients deal with limited
is to remove all existing messages from the timeline and only display the the new events (as clients don't generally render a "gap" between two timeline chunks).
What the SS proxy does is to set initial
true, but not include unchanged required_state
, which feels weird.
As I see it the options to deal with timeline trickling (in the short term) are:
- Set
initial: true
and include the full room data (including state) - Do what the proxy does and set
initial: true
, but don't send down unchanged state - Do what this PR does and set
limited: true
and don't send down unchanged state
I think the last one is the least worst. It's also worth noting that the client (mostly) opts into this behaviour by doing explicit room subscriptions with a differing timeline_limit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Side note: if we do do this PR we should update the docs for limited
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pattern laid out is to use /messages
. With the current Sliding Sync API, adjusting/repurposing fields/behaviors to mean other things is just trying to skirt the problem and adds tenuous intricacy to complexity.
If they're trying to get more timeline
, one solution could be to do another initial sync with the timeline_limit
that they want.
Clients should handle gaps though. Or at-least we shouldn't punish clients who do handle gaps.
For example, Hydrogen handles this with fragments and gap filling since it has offline support and doesn't throw away its work.
If we still think Sliding Sync should somehow address this, then I think we need to apply more thought to the API design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear: this is implementing the behaviour that the sliding sync proxy already supports.
The way that the rust sdk currently starts is to:
- Do an initial request for top 20 rooms and timeline limit of 1. This gets a page of results fast.
- Do a second request for top 20 rooms (via room subs) with timeline of 20. This preloads the room lists for the top rooms
- Expand the range request-by-request, with timeline limit of 1.
I agree that there is something a bit confusing with allowing the client to change timeline limit, but TBF they are opting into this behaviour (to an extent).
The pattern laid out is to use
/messages
. With the current Sliding Sync API, adjusting/repurposing fields/behaviors to mean other things is just trying to skirt the problem and adds tenuous intricacy to complexity.
We would be tweaking limited
to mean "these are the most recent N events and there may be a gap, if you've changed the timeline limit you may have already seen some of these events".
The alternative is to just use initial
, which really does mean "throw away everything you have".
Clients should handle gaps though. Or at-least we shouldn't punish clients who do handle gaps.
For example, Hydrogen handles this with fragments and gap filling since it has offline support and doesn't throw away its work.
I don't think this really punishes clients that handle gaps that much, it means that if they change the timeline limit they need to handle the fact that they may have already seen a subset of the events that have been returned.
If we do just use initial
, then it's up to the clients whether they want to just throw everything they have away or just use /messages
If we still think Sliding Sync should somehow address this, then I think we need to apply more thought to the API design.
We're not in a great position to change thing around right now, though I agree it should go on the list of things we should look into once we've reimplemented everything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing a second initial sync requires changing the app, which we can probably do but really right now I want us to focus on getting parity with the sliding sync proxy.
Given an initial sync would cover the use case in an idiomatic way, I don't think we should introduce this convoluted behavior.
This is an active regression in behaviour and UX from the SS proxy. It's clear we need to do something here to allow the apps to continue to operate well.
To be clear, using
limited: true
is the wrong option here. It throws a wrench in what an initial vs incremental sync mean.
limited
could betrue
in a legitimate scenario of too many messages but now we're also saying it can betrue
(along with a batch oftimeline
events) in a new random scenario of expandingtimeline_limit
which makes the downstream processing need to worry about how the request asked for things now.
Yes, we're subtly changing the meaning of limited
here, though I disagree that it is that confusing. I did try and go through the experiment about how this works for offline-first clients, and the semantics seem to fit perfectly? The limited
flag essentially becomes a flag as to whether the new chunk of events can be appended to the last timeline chunk, or whether it needs to be considered/handled as a new chunk of events (using the same logic they already have).
Additionally, the "opt-in" behavior you're talking about is also just another normal Sliding Sync request. Adjusting the
ranges
andtimeline_limit
is a normal thing to do that we don't need to introduce a foot-gun into.
It's really unclear to me what the semantics of increasing the timeline limit should be. Explicitly increasing the timeline limit very much feels like the app wants to get a bigger chunk of timeline down 🤷
I guess there is a time where this sort of happens implicitly, where we have two lists with different timeline limits, e.g. an "all-room" list with a limit of 1 and a "top-20" list with a limit of 20. We'd want to be careful to handle that sanely, though if an old room gets an update (which bumps it to the top-20 list) I don't think its insane for it to include the last 20 timeline events.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So to summarise my understanding here:
- the proxy has a bug whereby timeline trickling (increasing the timeline limit for a room the client already knows about) sets
initial: true
when returning the larger timeline, but does NOT sendrequired_state
, which breaks the description ofinitial
. - Erik is proposing to set
limited: true
instead as a better descriptor. - Eric thinks this is a footgun as it convolutes handling of gappy syncs, and counter-proposes sending
required_state
like the proxy should have technically been doing. - Erik is worried this is going to impact performance due to the increase in bytes-over-the-wire.
My opinion is that timeline trickling is awful and should be replaced with a bulk /messages
endpoint. /sync
is confusing enough to get right without having to think about historical state/messages. Far better to treat this endpoint as purely for real-time bare essential information, and defer to other APIs for more data. Some early sketches of other APIs (as well as use cases for the bulk endpoint) can be found in this document which is over 1 year old.
However, we do not live in this future. We live in the here and now, where we want to get SSS landed in Synapse ASAP and then iterate on any warts such as this. As a result, I don't really care which proposal we go with.
If neither of you can agree on the semantics, then might I suggest a tiebreak and literally just implement the behaviour of the proxy (that is initial
without sending required_state
iff the timeline_limit
is changed for the room). That will definitely work with real existing clients (no client changes needed, no additional bandwidth increase) since the proxy is the only complete impl of SS. That feels no less warty than the other 2 suggested proposals here.
Failing that, I would lean towards Erik's limited: true
on the basis that it is the most pragmatic solution here which won't negatively affect performance: which is literally the entire point of this API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's really nice to see the candid agreement on how bizarre this behavior is and plans for a better future!
⭐ My actual proposal is that the client can use an initial sync request and ask for timeline_limit: 20
and required_state: []
which means barely any extra bytes over the wire (just the small meta data like room name). We can abuse the sliding sync endpoint to bulk fetch messages in a room and we don't need to introduce any of this convoluted behavior even in the interim.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the weekly meeting, it seems like the only downside to my suggestion is that it requires client-side changes. The ElementX/Rust client team is out today so we aren't able to see whether this is a big deal or not.
Besides my suggestion, the better of the proposed options if we really want to push this through: use initial: true
without sending required_state
(match the proxy). This option optimizes for no client changes necessary which is the only reason to not use my suggestion, and is at-least some sort of approximation of what initial: true
actually means which seems better than messing with the semantics of limited
. While I disagree with moving forward with this, it is amenable to more forward so long as this doesn't suffer from the bump once problem.
Besides the existing things already discussed in this discussion thread, we also went over the possibility of adding a completely new flag to accurately describe and trigger this behavior or even a new field like initial_timeline
that shows up whenever the timeline_limit
is bumped. It's just more ugly workarounds that require client-side changes anyway but it's something that doesn't mess with what we have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conversation continued in #17579 (comment)
# If we're ignoring the timeline bound we *must* set limited to | ||
# true, as otherwise the client will append the received events | ||
# to the timeline, rather than replacing it. | ||
limited = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why isn't the client expanding their room timeline by paginating /messages
using the prev_batch
token given in the response?
That seems just as good if they're adding room subscriptions just to get more timeline
events.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, so this is really a (bit) of an abuse of the SS API so the client can avoid doing e.g. 20 pagination requests simultaneously.
if ignore_timeline_bound: | ||
# If we're ignoring the timeline bound we *must* set limited to | ||
# true, as otherwise the client will append the received events | ||
# to the timeline, rather than replacing it. | ||
limited = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One edge case this doesn't handle is when there are only a few events in the room total. This will claim limited = True
even if the timeline_limit
fetches all events in the room.
if room_status.timeline_limit is not None and ( | ||
room_status.timeline_limit < room_sync_config.timeline_limit | ||
): | ||
# If the timeline limit has been increased since previous | ||
# requests then we treat it as request for more events. We do | ||
# this by sending down a `limited` sync, ignoring the from | ||
# bound. | ||
ignore_timeline_bound = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One edge case we thought of during our weekly meeting, is that it seems like it's only possible to bump the timeline_limit
once to get a batch of timeline messages. So if you initially request everything with timeline_limit: 1
and then timeline_limit: 20
to get a bunch of timeline messages, requesting with timeline_limit: 20
again won't give you a batch of timeline again. You would have to request timeline_limit: 21
, etc into infinity every time you want to fetch the initial batch of timeline.
This comes into play even with ElementX because it grows and shrinks it's range over time and will want to fetch timeline for rooms that comes into view again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is actually covered in the PR, and I need to add tests, but: we when the timeline limit is reduced we update the timeline limit we stored, so we do correctly handle increase -> decrease -> increase
.
I'm going to close this PR in favour of: #17579, there's been a lot of changes to the underlying code, so I thought it'd make sense to start from fresh |
This supersedes #17503, given the per-connection state is being heavily rewritten it felt easier to recreate the PR on top of that work. This correctly handles the case of timeline limits going up and down. This does not handle changes in `required_state`, but that can be done as a separate PR. Based on #17575. --------- Co-authored-by: Eric Eastwood <[email protected]>
EX will have a low timeline limit in the standard room list, but will then use room subscriptions to request more history for certain rooms (i.e. the rooms the client things are at the top of the room list).
So if we see a timeline limit being increased for a room, then we need to send down some historical events. This is done by ignoring any from token for the timeline section and setting
limited=true