Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Update frozendict 2.3.3 -> 2.3.4 #13955

Merged
merged 2 commits into from
Oct 7, 2022
Merged

Conversation

DMRobertson
Copy link
Contributor

@DMRobertson DMRobertson commented Sep 29, 2022

This claims to fix more memory leaks, though from testing (see below) it apparently doesn't help people who are suffering pain when using presence.

Could have automated this upgrade with #11828 if we wanted.

This claims to fix more memory leaks.

Could have automated this upgrade with #11828 if we wanted.
@DMRobertson
Copy link
Contributor Author

DMRobertson commented Sep 29, 2022

I asked @tomsisk if he'd be willing to try a build with this frozendict bump. If you get a chance to do so, could you let me know what you find as a comment on this PR? (Even if there is no difference, it'd still be good to know.)

Let me know if you need a hand building the docker container. I think you want:

# from a Synapse checkout
git checkout origin dmr/try-frozendict-bump
DOCKER_BUILDKIT=1 docker build -f docker/Dockerfile -t synapse:1.68.0-frozendict-.2.3.4 .

@DMRobertson DMRobertson added the X-Needs-Info This issue is blocked awaiting information from the reporter label Sep 30, 2022
@tomsisk
Copy link

tomsisk commented Oct 1, 2022

I updated this last night and used it for about an hour with a couple of users. There was nothing of note, presence was working and the service was responsive. So I let the service run over night with presence on and we never saw a huge memory spike, but not many people are online during off hours. By the morning, the service was still running albeit extremely slow. Nothing weird in CPU or memory usage, but it was taking up to 60 seconds to send a message, login, etc. It was running, but unusable, so I had to turn presence off. The slowness is what we normally see after just a couple of minutes if we turn presence on during the day. In those cases, I've never been able to see a huge memory spike like we have in the past, but it seems those may have been connected to specific users.

Question: would it hurt anything if I truncated the presence_stream table? I don't know whether it would help anything either, but I noticed there is a lot of data going back a few years that hasn't been updated.

Back to the topic at hand: even with presence off, there appears to still be a small leak somewhere, I'm hoping these graphs help.
Here's a spike in memory.
image

Which coincides with a spike in _handle_new_device_update_async calls:
image

Now if we look at the cache memory, we see an increase on getEvent, but after 30 minutes the metrics say this memory was released. There is no similar drop in memory in the first image.
image

@DMRobertson DMRobertson self-assigned this Oct 6, 2022
@DMRobertson
Copy link
Contributor Author

@tomsisk Argh, sorry to hear this didn't help. I'll open a new issue based on your comment---it contains useful diagnostics, though it does seem like this is going to be very tricky to pin down. (I probably shouldn't have started a discussion on a PR anyway, sorry.)

@DMRobertson DMRobertson removed the X-Needs-Info This issue is blocked awaiting information from the reporter label Oct 6, 2022
@DMRobertson DMRobertson marked this pull request as ready for review October 6, 2022 22:43
@DMRobertson DMRobertson requested a review from a team as a code owner October 6, 2022 22:43
@DMRobertson
Copy link
Contributor Author

Going to assume the complement failure is a flake (partial joins, device lists...)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants