-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
toDevice messages on matrix.org can take 14.1s to propagate. #9564
Comments
Note that the linked Element Web issue has Synapse logs as well. At the moment it looks to have been tracked down to the toDevice message not waking up the synchrotron for ~14s. We do at least try to do so: synapse/synapse/handlers/devicemessage.py Lines 237 to 239 in fc8b3d8
We also talked about increasing Jaeger's monitoring on matrix.org to try and catch the full path of the request as it goes through the system (it's currently only at 10%). We also don't have the encryption worker in Jaeger's services at the moment, which would be a good initial fix to begin with. |
From the team sync today, we think getting our Jaeger deployment into a place where it's useful for solving this could turn into a pretty significant rabbit hole, and we're at capacity for rabbit holes at the moment. The hope is that we'll be able to come up for air in the next ~1-2 weeks and consider digging in more deeply, but in the meantime ara4n is going to do some manual digging and see if he turns up anything useful. |
related: #9533 |
So i've been trying to repro this today with:
Most of the messages are <1s; about 1 in 30 takes ~5 seconds, but (after waiting until later in the day for things to get a busier) i just had the one above take 8 seconds:
I'll try again 9-10pm tonight UK which is when i was seeing the worst behaviour before. However, it may be worth chasing this particular request given 8s feels surprisingly long, to see if it has the same symptoms as the ones I was tracking in element-hq/element-web#8376 |
On the encryption worker:
On synchrotron5:
We can see from the No idea, in short. It implies the synchrotron didn't know about the new to-device event, but it's unclear why. There's no particular evidence of GC, or delays in the events replication stream at least (which is a bit more verbose than the to_device stream, so easier to trace). So this is consistent with element-hq/element-web#8376. #9959 might help diagnose it a bit. |
@ara4n reported another delayed message:
On the encryption worker:
On synchrotron8:
so we can tell that the replication notification is making it as far as the synchrotron, but the /sync request isn't waking up for some reason (or it took 6 seconds between wakeup and return). Better add some more debug... |
I think this is basically fixed by #10124, though we'll leave some of the debugging in place to help trace it if it turns up again. |
element-hq/element-web#8376
This has been around for years, but bites me surprisingly frequently (even on test accounts!) making me assume it might be quite a widespread problem (and perhaps contribute to UISIs and encryption delays).
The text was updated successfully, but these errors were encountered: