Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Device list tracking is completely broken, causing UTDs #2907

Closed
Tracked by #245
richvdh opened this issue Dec 16, 2024 · 8 comments
Closed
Tracked by #245

Device list tracking is completely broken, causing UTDs #2907

richvdh opened this issue Dec 16, 2024 · 8 comments

Comments

@richvdh
Copy link
Member

richvdh commented Dec 16, 2024

There seems to be a bit of a problem in the way we track other users' device lists.

matrix-sdk-crypto keeps a cache of other users' device lists, and relies on a notification in the device_lists section of the /sync response to let it know when a user has updated their devices (ie, that cache is stale). When we send an encrypted message to another user, we use that cache to know who to send it to.

That's ok as long as we keep doing incremental syncs. The problem is that Element Call, when it starts up, does a full /sync (ie, it doesn't pass a since parameter); in that case, the homeserver returns an empty device_lists response.

So, if another user has logged in on a new device while we were offline, we end up with a stale cache for that user's devices, and they will receive UTD messages from us.

This seems like something that needs to be fixed in matrix-js-sdk. Probably the easiest solution is to invalidate all device list caches on restart, if there is no saved sync token.

@Half-Shot
Copy link
Member

That's ok as long as we keep doing incremental syncs. The problem is that Element Call, when it starts up, does a full /sync (ie, it doesn't pass a since parameter); in that case, the homeserver returns an empty device_lists response.

This feels odd. I just had a look at a fresh start on https://call.element.io/ and I am doing a partial sync.

image

Which makes me feel like there is some missing reproduction steps we need here.

@richvdh
Copy link
Member Author

richvdh commented Dec 17, 2024

Hrm, you may well be right. I can't really comment on reproducing this from within EC -- seems like maybe something to do with being a guest user?

The rageshake contains this as the first /sync in logs-0000.log.gz:

2024-11-27T18:02:09.523Z 1 matrix FetchHttpApi: --> GET https://matrix.guest.ess-ecall-poc.ems-support.element.dev/_matrix/client/v3/sync?filter=xxx&timeout=xxx&org.matrix.msc4222.use_state_after=xxx&_cacheBuster=xxx

(ie, no since param)

@Half-Shot
Copy link
Member

Half-Shot commented Dec 17, 2024

2024-11-27T18:02:08.273Z 1 matrix Configuring rageshake persistence...
2024-11-27T18:02:08.273Z 2 matrix Element Call 49cedf1c3b45623251cc3b6f24377c9229cda795
2024-11-27T18:02:08.862Z 2 matrix (Re)starting OpenTelemetry debug reporting
2024-11-27T18:02:08.862Z 2 matrix OTLP collector disabled

Ah, so this Element Call session corresponds to 49cedf1. We were mucking about with memory stores at the time due to another bug so I wonder if this is related.

That commit predates our revert of #2811, so it's possible that's what caused this.

I'd like to see a reproduction of the bug on a more modern Element Call, because I suspect this might be ancient history now (hopefully!)

@Half-Shot
Copy link
Member

This seems like something that needs to be fixed in matrix-js-sdk. Probably the easiest solution is to invalidate all device list caches on restart, if there is no saved sync token.

I'm minded to close this issue until we have more data, as I believe ECall now syncs "correctly". There probably is an issue for the JS-SDK about issues that may come from using the Memory Store, but I don't think this is an ECall issue?

@richvdh do you concur?

@richvdh
Copy link
Member Author

richvdh commented Dec 17, 2024

oh, that's a bit frustrating. It took me quite a bit of digging to figure this out.

Still, if Element-Call no longer exhibits this behaviour, and the instance that @toger5 used to cause it has been updated, then yes, let's close it.

One question: #2811 is itself a revert. Do you mean that the revert was later reverted, or that #2811 fixed this? In other words, please can you be explicit about the PR that you think fixed this?

@Half-Shot
Copy link
Member

Right, yes let's clear that up.

@toger5
Copy link
Contributor

toger5 commented Dec 17, 2024

The issue you dug up is still an actual issue with the js-sdk? Even if EC is not using MemoryStore anymore any other app with memory store would run into the same problem right?
We should move it over to the js-sdk.

@richvdh
Copy link
Member Author

richvdh commented Dec 17, 2024

@toger5 opening a separate issue against the js-sdk sounds like a great idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants