-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent locking up while processing batched_auth_events #16968
Prevent locking up while processing batched_auth_events #16968
Conversation
@anoadragon453 @jaywink @erikjohnston Would you mind taking a look at this PR when you get a chance? Any feedback would be appreciated! I've been running the build with the latest changes for three days now without issues. |
Hi @ggogel. Hugely appreciate you debugging this issue, pushing up a fix and confirming that it works on your own deployment. I apologise that we've been so slow to respond over the past couple weeks. It's a direct symptom of a recently reduced team size and a separate (internal) project that's taking up most of the team's time. We'll be freed up again in a couple weeks, but until then this is where we're at. Regardless, this is a small change and we'll find time to review it before the next release cycle (v1.103.0rc1). Thanks again for your patience. I've approved CI to run on this PR. Could you fix up the lint/typechecking in the meantime? |
Hi @anoadragon453, no need to apologize at all. I completely understand that team priorities can shift and that everyone's plate can get pretty full at times. I appreciate your acknowledgment of the fix. I got the linting and type checking sorted out. Looks like the CI workflow needs approval again. Looking forward to collaborating further on this and getting it ready for the upcoming release cycle. |
Linting and type checking are good now but it timed out during sytest.
|
The typing change in dfac2ca caused the issue to reappear. - auth_events = batched_auth_events
+ auth_events = dict(batched_auth_events) If someone can shed light on why employing |
I've found a fix that passes My best guess so far is that |
I'm testing the latest build 3e642ed by repeatedly joining and leaving rooms. Successful:
Keeps failing:
I'm unsure if it's failing where the changes are or somewhere else. I'm working on yet another change at the moment... |
I've finally found a solution that has good performance and makes My initial working solution was the fastest because no copying was done at all. I simply assigned the dictionary to another variable. However, due to
Then I changed it to use
Since
After that, I tried the following approach of copying each item one by one into a new dictionary. This was much faster, but still caused issues with the humongous room #matrix:matrix.org. It was locked for 2 minutes and got killed by the liveness probe. Joining this room seems to be the most difficult benchmark for synapse. All other rooms I tried, worked fine with this solution.
I explored other options such as using the dictionary unpacking operator As a side note: When using I finally ended up with this solution, which matches my initial solution but also satisfies
|
44411a2
to
4a3e6c7
Compare
4a3e6c7
to
5bd7d93
Compare
Following a discussion with @erikjohnston, I've implemented a change to utilize I've successfully tested the change on:
|
Thanks @ggogel for digging into this and trying out various of the fixes! |
# Synapse 1.104.0 (2024-04-02) ### Bugfixes - Fix regression when using OIDC provider. Introduced in v1.104.0rc1. ([\#17031](element-hq/synapse#17031)) # Synapse 1.104.0rc1 (2024-03-26) ### Features - Add an OIDC config to specify extra parameters for the authorization grant URL. IT can be useful to pass an ACR value for example. ([\#16971](element-hq/synapse#16971)) - Add support for OIDC provider returning JWT. ([\#16972](element-hq/synapse#16972), [\#17031](element-hq/synapse#17031)) ### Bugfixes - Fix a bug which meant that, under certain circumstances, we might never retry sending events or to-device messages over federation after a failure. ([\#16925](element-hq/synapse#16925)) - Fix various long-standing bugs which could cause incorrect state to be returned from `/sync` in certain situations. ([\#16949](element-hq/synapse#16949)) - Fix case in which `m.fully_read` marker would not get updated. Contributed by @SpiritCroc. ([\#16990](element-hq/synapse#16990)) - Fix bug which did not retract a user's pending knocks at rooms when their account was deactivated. Contributed by @hanadi92. ([\#17010](element-hq/synapse#17010)) ### Updates to the Docker image - Updated `start.py` to generate config using the correct user ID when running as root (fixes [\#16824](element-hq/synapse#16824), [\#15202](element-hq/synapse#15202)). ([\#16978](element-hq/synapse#16978)) ### Improved Documentation - Add a query to force a refresh of a remote user's device list to the "Useful SQL for Admins" documentation page. ([\#16892](element-hq/synapse#16892)) - Minor grammatical corrections to the upgrade documentation. ([\#16965](element-hq/synapse#16965)) - Fix the sort order for the documentation version picker, so that newer releases appear above older ones. ([\#16966](element-hq/synapse#16966)) - Remove recommendation for a specific poetry version from contributing guide. ([\#17002](element-hq/synapse#17002)) ### Internal Changes - Improve lock performance when a lot of locks are all waiting for a single lock to be released. ([\#16840](element-hq/synapse#16840)) - Update power level default for public rooms. ([\#16907](element-hq/synapse#16907)) - Improve event validation. ([\#16908](element-hq/synapse#16908)) - Multi-worker-docker-container: disable log buffering. ([\#16919](element-hq/synapse#16919)) - Refactor state delta calculation in `/sync` handler. ([\#16929](element-hq/synapse#16929)) - Clarify docs for some room state functions. ([\#16950](element-hq/synapse#16950)) - Specify IP subnets in canonical form. ([\#16953](element-hq/synapse#16953)) - As done for SAML mapping provider, let's pass the module API to the OIDC one so the mapper can do more logic in its code. ([\#16974](element-hq/synapse#16974)) - Allow containers building on top of Synapse's Complement container is use the included PostgreSQL cluster. ([\#16985](element-hq/synapse#16985)) - Raise poetry-core version cap to 1.9.0. ([\#16986](element-hq/synapse#16986)) - Patch the db conn pool sooner in tests. ([\#17017](element-hq/synapse#17017)) ### Updates to locked dependencies * Bump anyhow from 1.0.80 to 1.0.81. ([\#17009](element-hq/synapse#17009)) * Bump black from 23.10.1 to 24.2.0. ([\#16936](element-hq/synapse#16936)) * Bump cryptography from 41.0.7 to 42.0.5. ([\#16958](element-hq/synapse#16958)) * Bump dawidd6/action-download-artifact from 3.1.1 to 3.1.2. ([\#16960](element-hq/synapse#16960)) * Bump dawidd6/action-download-artifact from 3.1.2 to 3.1.4. ([\#17008](element-hq/synapse#17008)) * Bump jinja2 from 3.1.2 to 3.1.3. ([\#17005](element-hq/synapse#17005)) * Bump log from 0.4.20 to 0.4.21. ([\#16977](element-hq/synapse#16977)) * Bump mypy from 1.5.1 to 1.8.0. ([\#16901](element-hq/synapse#16901)) * Bump netaddr from 0.9.0 to 1.2.1. ([\#17006](element-hq/synapse#17006)) * Bump pydantic from 2.6.0 to 2.6.4. ([\#17004](element-hq/synapse#17004)) * Bump pyo3 from 0.20.2 to 0.20.3. ([\#16962](element-hq/synapse#16962)) * Bump ruff from 0.1.14 to 0.3.2. ([\#16994](element-hq/synapse#16994)) * Bump serde from 1.0.196 to 1.0.197. ([\#16963](element-hq/synapse#16963)) * Bump serde_json from 1.0.113 to 1.0.114. ([\#16961](element-hq/synapse#16961)) * Bump types-jsonschema from 4.21.0.20240118 to 4.21.0.20240311. ([\#17007](element-hq/synapse#17007)) * Bump types-psycopg2 from 2.9.21.16 to 2.9.21.20240311. ([\#16995](element-hq/synapse#16995)) * Bump types-pyopenssl from 23.3.0.0 to 24.0.0.20240311. ([\#17003](element-hq/synapse#17003)) # Synapse 1.103.0 (2024-03-19) No significant changes since 1.103.0rc1. # Synapse 1.103.0rc1 (2024-03-12) ### Features - Add a new [List Accounts v3](https://element-hq.github.io/synapse/v1.103/admin_api/user_admin_api.html#list-accounts-v3) Admin API with improved deactivated user filtering capabilities. ([\#16874](element-hq/synapse#16874)) - Include `Retry-After` header by default per [MSC4041](matrix-org/matrix-spec-proposals#4041). Contributed by @clokep. ([\#16947](element-hq/synapse#16947)) ### Bugfixes - Fix joining remote rooms when a module uses the `on_new_event` callback. This callback may now pass partial state events instead of the full state for remote rooms. Introduced in v1.76.0. ([\#16973](element-hq/synapse#16973)) - Fix performance issue when joining very large rooms that can cause the server to lock up. Introduced in v1.100.0. Contributed by @ggogel. ([\#16968](element-hq/synapse#16968)) ### Improved Documentation - Add HAProxy example for single port operation to reverse proxy documentation. Contributed by Georg Pfuetzenreuter (@tacerus). ([\#16768](element-hq/synapse#16768)) - Improve the documentation around running Complement tests with new configuration parameters. ([\#16946](element-hq/synapse#16946)) - Add docs on upgrading from a very old version. ([\#16951](element-hq/synapse#16951)) ### Updates to locked dependencies * Bump JasonEtco/create-an-issue from 2.9.1 to 2.9.2. ([\#16934](element-hq/synapse#16934)) * Bump anyhow from 1.0.79 to 1.0.80. ([\#16935](element-hq/synapse#16935)) * Bump dawidd6/action-download-artifact from 3.0.0 to 3.1.1. ([\#16933](element-hq/synapse#16933)) * Bump furo from 2023.9.10 to 2024.1.29. ([\#16939](element-hq/synapse#16939)) * Bump pyopenssl from 23.3.0 to 24.0.0. ([\#16937](element-hq/synapse#16937)) * Bump types-netaddr from 0.10.0.20240106 to 1.2.0.20240219. ([\#16938](element-hq/synapse#16938))
This PR aims to fix #16895, caused by a regression in #7 and not fixed by #16903. The PR #16903 only fixes a starvation issue, where the CPU isn't released. There is a second issue, where the execution is blocked. This theory is supported by the flame graphs provided in #16895 and the fact that I see the CPU usage reducing and far below the limit.
Since the changes in #7, the method
check_state_independent_auth_rules
is called with the additional parameterbatched_auth_events
:synapse/synapse/handlers/federation_event.py
Lines 1741 to 1743 in 6fa13b4
It makes the execution enter this if clause, introduced with #15195
synapse/synapse/event_auth.py
Lines 178 to 189 in 6fa13b4
There are two issues in the above code snippet.
First, there is the blocking issue. I'm not entirely sure if this is a deadlock, starvation, or something different. In the beginning, I thought the copy operation was responsible. It wasn't. Then I investigated the nested
store.get_events
inside the functionupdate
. This was also not causing the blocking issue. Only when I replaced the set difference operation (-
) with a list comprehension, the blocking was resolved. Creating and comparing sets with a very large amount of events seems to be problematic.This is how the flamegraph looks now while persisting outliers. As you can see, the execution no longer locks up in the above function.
Second, the copying here doesn't serve any purpose, because only a shallow copy is created. This means the same objects from the original dict are referenced. This fails the intention of protecting these objects from mutation. The review of the original PR matrix-org/synapse#15195 had an extensive discussion about this matter.
Various approaches to copying the auth_events were attempted:
Concluding, there is no easy way to create an actual copy of the objects. Opting for a deepcopy can significantly strain memory and CPU resources, making it an inefficient choice. I don't see why the copy is necessary in the first place. Therefore I'm proposing to remove it altogether.
After these changes, I was able to successfully join these rooms, without the main worker locking up: