Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

/sync doesn't inform clients when there is a gap in the timeline #16463

Closed
erikjohnston opened this issue Oct 11, 2023 · 12 comments · Fixed by #16485
Closed

/sync doesn't inform clients when there is a gap in the timeline #16463

erikjohnston opened this issue Oct 11, 2023 · 12 comments · Fixed by #16485
Labels
A-Federation A-User-Experience O-Uncommon Most users are unlikely to come across this or unexpected workflow S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@erikjohnston
Copy link
Member

If there has bee e.g. a netsplit in a room where the remote servers have sent more than ten events, then when the netsplit resolves the local server will fetch up to 10 events. The remaining events will be backfilled when a client does an appropriate call to /messages.

The syncing clients will receive the new events down their timeline, however the limited flag is not set. This means the client thinks there is no gap, and so won't try and backfill any missed messages.

The net result is that clients won't ever see the older events in the remote fork.

@erikjohnston
Copy link
Member Author

I think the fix here is to somehow detect that we have a backwards extremity between the prev_batch and current token, and if so set the limited flag (ensuring we don't send down any events from before the potential gap)

@clokep
Copy link
Member

clokep commented Oct 11, 2023

Not directly related, but MSC3871 talks a bit about detecting gaps in timelines. (Although from a /messages POV.)

@clokep
Copy link
Member

clokep commented Oct 11, 2023

The syncing clients will receive the new events down their timeline, however the limited flag is not set. This means the client thinks there is no gap, and so won't try and backfill any missed messages.

Why is the limited flag not set? Is it because because 10 events is less than the default 20 events that clients usually request?

@clokep
Copy link
Member

clokep commented Oct 11, 2023

The remaining events will be backfilled when a client does an appropriate call to /messages.

Would #13576 help at all? (Proactively backfilling more events.)

@erikjohnston
Copy link
Member Author

The syncing clients will receive the new events down their timeline, however the limited flag is not set. This means the client thinks there is no gap, and so won't try and backfill any missed messages.

Why is the limited flag not set? Is it because because 10 events is less than the default 20 events that clients usually request?

Yup, I believe so.

The remaining events will be backfilled when a client does an appropriate call to /messages.

Would #13576 help at all? (Proactively backfilling more events.)

It'd mitigate this a bit, yes.

@clokep
Copy link
Member

clokep commented Oct 11, 2023

I guess a band-aid might be to try to backfill 100 events or something. This won't work though when the netsplit is only 11 events.


I'm having some trouble tracking here how the backfill would even work. Though, currently the homeserver wouldn't attempt to backfill those until the user backscrolled (via /messages) to where topologically the netsplit occurred, correct? (What I'm asking is, even with the limited flag set, how would a client get those messages easily due to the split in ordering between /messages and /sync -- see matrix-org/matrix-spec#852).

@erikjohnston
Copy link
Member Author

Oh, hmm.

Firstly: I care less right now if you have backpaginate a bunch before you actually do a backfill, but agreed that is sucky.

Secondly, I think that if clients /messages set from to be prev_batch and to to be the sync token returned by the previous sync, we'll never actually paginate to the topological ordering where we'd trigger backfill. I'm not sure if clients actually set a to?

@clokep
Copy link
Member

clokep commented Oct 11, 2023

I'm not sure if clients actually set a to?

I do see some hits for it, but they all are either a 401 error with missing access token or the client disconnects before a response is sent?

$ grep "to=" synchrotron*.log | grep -v " 401 " | grep -v "already disconnected"

@erikjohnston
Copy link
Member Author

Oh, that's exciting!

@clokep
Copy link
Member

clokep commented Oct 11, 2023

I think the fix here is to somehow detect that we have a backwards extremity between the prev_batch and current token, and if so set the limited flag (ensuring we don't send down any events from before the potential gap)

We talked a bit about this on the phone, it might work to check something about the topo ordering of the first (or last?) event sent down sync vs. the latest topo of the room or something. I'm unsure this would work.

Maybe another thing would be if any of the events in the room since the since token have a stream ordering that's negative or a prev event that has a negative stream ordering?

@erikjohnston
Copy link
Member Author

erikjohnston commented Oct 11, 2023

Looks like artificially forcing the limited flag doesn't use the to parameter on element web at least, so that might still be enough for a short term fix. It looks to throw away the entire timeline and paginate it back in?

@erikjohnston
Copy link
Member Author

My plan is to add a table that records whenever we receive an event over federation which we don't have the prev events for (and haven't fetched), i.e. where we have a "gap". Then we use that to force the limited flag.

@reivilibre reivilibre added A-Federation S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. O-Uncommon Most users are unlikely to come across this or unexpected workflow A-User-Experience labels Oct 12, 2023
erikjohnston added a commit to matrix-org/complement that referenced this issue Oct 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Federation A-User-Experience O-Uncommon Most users are unlikely to come across this or unexpected workflow S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants