-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Clients repeatedly losing connection for up to a minute #12547
Comments
Thanks for reporting! A little more information might be helpful: can you tell me what client(s) you are using? Approximately how long do you need to be connected before this error occurs? How often is this occurring? Can you add some logs showing the lines before and after the lines you have provided, so that there is a little context on what is happening? Do you happen to know what version of python the homeserver is running on? |
Hi, thanks for the speedy response! Everyone I've confirmed is experiencing this (including myself) is using element on android or desktop (macos, windows and linux). The outages occur periodically throughout the day, sometimes multiple times per hour, sometimes once every few hours. One person noticed that if I haven't restarted the server for a few days the issues will occur more frequently around the same times each day (I haven't been tracking them closely enough to confirm this is the case). I noticed the issue at 2:48pm today and found the gc lines in the log so I'm fairly certain these are the ones, but let's hold off on assuming these are definitely associated with the issue and wait until I catch the logging out put while it's in the act. (I'm just sharing in case you notice something unexpected in there).
The version of python is 3.10.4 Thanks again! EDIT: So far today all the drops have been short enough that messages don't even seem delayed- I'm waiting for downtime like last night to happen again where it's unresponsive for half a minute or more. |
OK, finally caught a minute-long disconnect in progress!
EDIT: and I'm currently in the middle of another one
|
Here's another one-- this time there was only gc 1 and there was output after while still disconnected (unless the client just took a second to retry) but it was still offline for about 30 seconds. Confirmed offline timestamp for this one is
|
Looks like at least one user on element for iOS has seen the issue with that client too-- she gets a timeout after a lengthy period of syncing |
Both of these were taken during a connectivity drop (so it was down during the last line) -- interestingly the first one doesn't have the gc lines:
|
I was watching bandwidth usage on the server when connectivity is gone, and it was less than 10kb/s in either direction before jumping to 300kb/s download and 30kb/s up a second before connectivity was restored. This error appeared again during that outage:
|
Thanks for the logs, very helpful! I have a wild hunch that sync is implicated here but need to do more digging. One more thing-can you provide the exact error message the client is showing when connectivity is lost? |
Specifically, I think that the processing of the /sync response is taking longer than the client timeout and the connection is getting dropped as a result. |
I think you might be right-- I hadn't realized the client included error logs, but it seems like they're full of sync errors:
I have lots more, but considering it just repeats again and again I assume that's probably enough? (let me know if not) EDIT: I should add that I'm able to ssh into my server without trouble while this is going on and the bandwidth and CPU usage is low (memory use is always fairly high, though I added 2 gigs to the swap today and it didn't use much more or change anything) |
Alright, so it appears we are seeing some mysteriously long /sync requests in the logs, but don't have enough information to determine why the processing of those requests is taking so long. Two things might be helpful here: first, if you would be willing to provide logs for a longer timeframe (4-24 hours) within which you're sure there were a few timeouts (you can upload as a .txt file), that would be very useful. Second, do you have graphana or any other sort of metrics/monitoring enabled on your homeserver? If not, it might be something that you would want to consider and would possibly help in debugging this situation. |
To add onto what Shay said, Synapse reports timings for many processes, including GC durations, as part of its Prometheus metrics. It's a little suspicious that you've seen some an instance of connectivity drops without any GC log lines. Looking at the metrics would allow us to confirm or rule out GC as the source of the issue. |
Thanks for the quick responses, I really appreciate the help in tracking down the source of this issue! I agree that it's suspicious that there are drops without GC log lines- most of them have them, and all of them I'd checked prior to the ones I'd posted without had them so I'd assumed that was to blame, but now I'm not so sure. I don't currently have any sort of metrics/monitoring running, but can look into setting that up soon. For now, here are the client and server logs for the 27th (which had lots of outages). I redacted usernames, channel names, emails and some server addresses using regex for the privacy of the people on my server, so let me know if you run into anything I inadvertently clobbered. http://96.126.108.7:90/client-2022-04-27.log Thanks again! |
Not sure if it's related, but the server was inaccessible for a chunk of the morning today, and it hung attempting to restart the service. Logs showed this after hanging:
|
Right thanks for the logs. I am seeing lots of "No state group for unknown or outlier event" errors, which is related to some validation that was added recently. Can you run this query on your db and let me know the result? select * from event_forward_extremities efe left join events e using (event_id) where e.event_id is null; |
Those issues do sound very similar! I ran that select query and nothing was returned on that note.
|
I think this is essentially #12549. As is outlined in that issue, seeing many "No state group for unknown or outlier event" errors is the result of database corruption that may have happened long ago but is now being highlighted by some recently added validation logic. There are some suggestions on that issue on how to mitigate the situation, and in light of that I am going to close this issue in favor of #12549, but feel free to ask more questions on that issue if you can't resolve the problem. |
Hmm, I only migrated to postgresql two months ago, and it was good for the first month- it'd be weird to run into corruption so quickly wouldn't it? I'll take a look at the other issue though, thanks for pointing me there! |
still getting disconnects with 1.55.2, but so far they're shorter and less frequent |
What's your thinking here Shay? I think(?) we'd expect borked data to result in an error response rather than a dropped connection. (Edit: there probably is DB corruption, but I'm not sure it explains the reported symptoms) |
Good question-I think sticking with 1.57, since that is what the issue was opened with and what your provided logs for makes the most sense. |
It seems like the outages are slightly less frequent, though I've noticed that sending messages sometimes takes up to 30 seconds now so maybe it's just the client isn't noticing them as quickly? EDIT: When I send a message and it's taking 30+ seconds I'll get the "connectivity to the server has been lost" message and then a loading spinner before the send goes through |
So it seems like synapse works well for an hour or two after restarting before sending messages starts to lag like crazy, and then a bit after that the drops start happening again too- reducing the load made the curve a bit more gradual and easier to chart, and it seems to imply that there's something going on in synapse that's making it less and less efficient? In case it helps, a bit more background: A couple months ago I'd migrated from sqlite to postgres and things were suddenly lightning fast; previously messages would take 2-3 seconds to send, which makes sense considering it was interacting with a single file, but now everything was immediate and snappy. Things remained lightning fast for a month and a week or so until a week before I created this issue, and a few days after I'd upgraded synapse to 1.55. Since creating this issue I've done the recommended database maintenance (rebuilding the index, compress state, etc) and while it dropped the size down like crazy, things have slowly gotten worse. Also in case it matters, I don't have an expiry set for old content (so I believe there are about 3 years of content in the database). Would it be worth collecting more logs since switching presence off? |
Hmm so two things might be helpful here: collecting another set of logs, this time with the logging level set to debug; and as was mentioned before, if it all possible it would be very helpful to add monitoring to your Synapse instance: https://matrix-org.github.io/synapse/latest/metrics-howto.html. |
Hey, I migrated to a new server with better specs on the chance that slow I/O or something was to blame and while I haven't set prometheus up yet (that's next on my list!), it does have some basic monitoring included. I'm still getting timeouts, but I can see now that they're associated with higher CPU usage, they're also a bit less frequent and the duration is shorter. There were timeouts happening all through the higher CPU usage there, and they stopped (for the moment) once CPU usage dropped. I noticed that both synapse and postgres are using more CPU in these situations. Once I get prometheus up I'll collect some logs along side so you have a complete picture. |
So I'm running a puppet bridge on the same server, and this is what the log shows when the connection drops:
|
So I have debug level logging enabled and prometheus + grafana (with synapse.json) running- I'll be working tomorrow and can keep element running on a screen on the side to see what the connectivity is like. When things go bad should I be watching for something in grafana, or is there a particular thing that would be useful to export? |
Interesting, I refreshed grafana during an outage and got a broken pipe error in synapse:
|
Got this in the logs during an outage:
|
Here's the log with debugging output of the latest outage -- seems like the log goes from outputting messages every so many milliseconds to nothing for over a minute:
|
That's curious. It's interesting to see that there's a gap in the dashboard data during the outage as well? I'm not sure what to make of it yet. |
Completely random thought, but I've been noticing that restarting synapse only helps a bit while restarting the server helps a lot, and it occurred to me that I've allocated swap space that appears to be getting used... That wouldn't impact performance enough to cause this type of drop would it? Right after setting the new server up there were a bunch of outages without swap allocated, but then it was also catching up with the sync, and I allocated it right after that... Maybe I'll try without and see how it goes. |
The dashboard dropping in and out is very suspicious...
Running out of memory might explain a lot. Could you provide the first couple of lines of output from |
Disabling swap didn't help- usually RAM is like 80% and swap is ~500 megs used - I've re-enabled swap and will report back once things start slowing down again. |
So after 8 hours and a number of outages, it seems like I have:
Also, sorry, I didn't mean the dashboard itself was dropping in and out, rather, the screenshot of the dashboard is showing what synapse looked like while dropping in and out. |
Caught some more debugging output during a shorter outage-- once again it seems like the log stops outputting for multiple seconds when it normally outputs every few milliseconds.
I've also increased the available swap space to 6 gigs (meaning there's a total of 8 gigs available to synapse) -- currently there's 4.7 gigs free |
Considering there seems to be available CPU and postgres doesn't seem to be getting stressed, would it make sense to add a worker to take a load off the main process? EDIT: I'm now running two workers- one for client requests and one for federation requests. I'll report back on how it goes! |
Hmm, something seems to have caused the client worker to crash:
EDIT: Trying workers again today- no crash so far, but definitely still getting drops and lag EDIT 2: Hit the same crash again |
So here's a question... If I blow away the database and assets and start fresh, am I going to have any issues with federation moving forward? Considering other servers would think people from mine were already connected to their channels and whatnot. If I'm the only one having these issues, the only common threads between the previous server and the new one is the database and the config. |
Hey, so I have an update- one of the people on my server wanted to make some backups before I reset the database but needed the server a bit more stable first, so I was going to whitelist his IP and block everyone else. I initially started with just the federation block to test though, and without federation enabled everything is running perfectly. No delays and no outages for the last 12 hours. |
@H-Shay Hey, figured I'd give you a mention since this should narrow down the surface of the bug pretty considerably ^^ -- Federation is still enabled, I've just blocked outgoing/incoming in the reverse proxy and I've had zero issues. I'm not sure if there's a bug in the federation logic, or possible a malicious/buggy server? It's possible my server isn't beefy enough to handle a bunch of federation stuff, too, but the resources are far from strained on the actual server. I'm planning on wiping and replacing synapse with dendrite in a few days, so if you think my logs might be helpful in solving this (now that we see other people running into a similar situation?) I can help in the mean time. I have full monitoring setup at this point, so anything you need should be possible to get. |
@lnicola yeah, that seems almost identical- does disabling federation resolve the issue for you? (not that disabling federation is a resolution) |
One more thing that changed when federation got disabled is the database doesn't seem to be growing by nearly a gig every day. I almost wonder if a bot account didn't connect to a malicious federated server and just dump tons of data in? Though there was typically very little actual traffic happening. |
So just to follow up- I reset with dendrite, noticed my subdomain had actually been banned on at least one server (likely due to malicious activity) and tons of federation activity was incoming from dozens of domains that didn't have valid certs, so I switched to my top level domain. Dendrite had too many little bugs so I switched back to synapse, and on the new domain with the same synapse config and a reset database (with no more of what I assume were malicious bot accounts), everything has been working swimmingly. Based on everything I've seen and the conversation @ #12730 I think we can conclude that this "bug" is actually the result of malicious bots and can be solved by not having open registration (which is a bit unfortunate). I'm going to close my issue since I've reached a resolution, but if the synapse devs think it's worth treating the ability for malicious accounts to take out a server (or at least the inability to easily resolve that when it happens) a missing feature, it might be worth using #12730 to come up with a solution. Either way, thanks to everyone who offered help here, and for all the hard works the devs on this project have put into making such a great tool |
Apologies if this isn't actually a bug, but my server had been running well until updating a couple of weeks ago. Unfortunately I didn't catch the exact version as I was travelling and didn't notice the issue between a number of updates.
Description
Throughout the day, clients will show a message saying that they can't connect to the server for ~30 seconds to a minute. Every time I've checked the current log to see what's up it's shown the following:
Steps to reproduce
sentinel - Collecting gc 2
connectivity resumesI've noticed that
synapse.metrics._gc - 120 - INFO - sentinel - Collecting gc 1
will appear in the logs on its own and I won't have any connectivity issues when that happens.Version information
I could probably provide more information about behaviour if I know what I'm looking for.
Thanks for your time and for all the work on this project!
The text was updated successfully, but these errors were encountered: