Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent locking up while processing batched_auth_events #16968

Merged
merged 30 commits into from
Mar 12, 2024

Conversation

ggogel
Copy link
Contributor

@ggogel ggogel commented Feb 27, 2024

This PR aims to fix #16895, caused by a regression in #7 and not fixed by #16903. The PR #16903 only fixes a starvation issue, where the CPU isn't released. There is a second issue, where the execution is blocked. This theory is supported by the flame graphs provided in #16895 and the fact that I see the CPU usage reducing and far below the limit.

Since the changes in #7, the method check_state_independent_auth_rules is called with the additional parameter batched_auth_events:

await check_state_independent_auth_rules(
self._store, event, batched_auth_events=auth_map
)

It makes the execution enter this if clause, introduced with #15195

if batched_auth_events:
# Copy the batched auth events to avoid mutating them.
auth_events = dict(batched_auth_events)
needed_auth_event_ids = set(event.auth_event_ids()) - batched_auth_events.keys()
if needed_auth_event_ids:
auth_events.update(
await store.get_events(
needed_auth_event_ids,
redact_behaviour=EventRedactBehaviour.as_is,
allow_rejected=True,
)
)

There are two issues in the above code snippet.

First, there is the blocking issue. I'm not entirely sure if this is a deadlock, starvation, or something different. In the beginning, I thought the copy operation was responsible. It wasn't. Then I investigated the nested store.get_events inside the function update. This was also not causing the blocking issue. Only when I replaced the set difference operation (- ) with a list comprehension, the blocking was resolved. Creating and comparing sets with a very large amount of events seems to be problematic.

This is how the flamegraph looks now while persisting outliers. As you can see, the execution no longer locks up in the above function.
output_2024-02-28_13-59-40

Second, the copying here doesn't serve any purpose, because only a shallow copy is created. This means the same objects from the original dict are referenced. This fails the intention of protecting these objects from mutation. The review of the original PR matrix-org/synapse#15195 had an extensive discussion about this matter.

Various approaches to copying the auth_events were attempted:

  1. Implementing a deepcopy caused issues due to builtins.EventInternalMetadata not being pickleable.
  2. Creating a dict with new objects akin to a deepcopy.
  3. Creating a dict with new objects containing only necessary attributes.

Concluding, there is no easy way to create an actual copy of the objects. Opting for a deepcopy can significantly strain memory and CPU resources, making it an inefficient choice. I don't see why the copy is necessary in the first place. Therefore I'm proposing to remove it altogether.

After these changes, I was able to successfully join these rooms, without the main worker locking up:

  • #synapse:matrix.org
  • #element-android:matrix.org
  • #element-web:matrix.org
  • #ecips:matrix.org
  • #ipfs-chatter:ipfs.io
  • #python:matrix.org
  • #matrix:matrix.org

@ggogel ggogel requested a review from a team as a code owner February 27, 2024 10:59
@CLAassistant
Copy link

CLAassistant commented Feb 27, 2024

CLA assistant check
All committers have signed the CLA.

@ggogel ggogel marked this pull request as draft February 27, 2024 16:59
@ggogel ggogel changed the title deepcopy batched_auth_events Do not create a copy of batched_auth_events Feb 27, 2024
@ggogel ggogel marked this pull request as ready for review February 28, 2024 09:23
@ggogel ggogel changed the title Do not create a copy of batched_auth_events Prevent locking up while processing batched_auth_events Feb 28, 2024
@ggogel
Copy link
Contributor Author

ggogel commented Mar 2, 2024

@anoadragon453 @jaywink @erikjohnston

Would you mind taking a look at this PR when you get a chance? Any feedback would be appreciated!

I've been running the build with the latest changes for three days now without issues.

@anoadragon453
Copy link
Member

Hi @ggogel. Hugely appreciate you debugging this issue, pushing up a fix and confirming that it works on your own deployment.

I apologise that we've been so slow to respond over the past couple weeks. It's a direct symptom of a recently reduced team size and a separate (internal) project that's taking up most of the team's time. We'll be freed up again in a couple weeks, but until then this is where we're at.

Regardless, this is a small change and we'll find time to review it before the next release cycle (v1.103.0rc1). Thanks again for your patience.

I've approved CI to run on this PR. Could you fix up the lint/typechecking in the meantime?

@ggogel
Copy link
Contributor Author

ggogel commented Mar 4, 2024

Hi @anoadragon453, no need to apologize at all. I completely understand that team priorities can shift and that everyone's plate can get pretty full at times. I appreciate your acknowledgment of the fix.

I got the linting and type checking sorted out. Looks like the CI workflow needs approval again.

Looking forward to collaborating further on this and getting it ready for the upcoming release cycle.

@ggogel
Copy link
Contributor Author

ggogel commented Mar 7, 2024

Linting and type checking are good now but it timed out during sytest.

2024-03-07 10:28:23.781 UTC [2642] FATAL:  database "sytest_template" does not exist
    Test 152 Remote users can join room by alias... WARN: DBI connect('dbname=sytest_template;host=localhost','postgres',...) failed: FATAL:  database "sytest_template" does not exist at lib/SyTest/Homeserver.pm line 456.
2024-03-07 10:28:23.790 UTC [2644] FATAL:  database "sytest_template" does not exist

@ggogel
Copy link
Contributor Author

ggogel commented Mar 8, 2024

The typing change in dfac2ca caused the issue to reappear.

- auth_events = batched_auth_events
+ auth_events = dict(batched_auth_events)

If someone can shed light on why employing dict() in this context causes a deadlock again, I would appreciate the insight.

@ggogel
Copy link
Contributor Author

ggogel commented Mar 8, 2024

I've found a fix that passes mypy and prevents the locking issue by iterating over the mapping and adding each item individually.

My best guess so far is that dict() causes the whole mapping to get locked and I believe the same happens when using the set difference operation.

@ggogel
Copy link
Contributor Author

ggogel commented Mar 8, 2024

I'm testing the latest build 3e642ed by repeatedly joining and leaving rooms.

Successful:

  • #ipfs-chatter:ipfs.ioorg
  • #archlinux:archlinux.org
  • #ecips:matrix.org
  • #fdroid:f-droid.org

Keeps failing:

  • #matrix:matrix.org

I'm unsure if it's failing where the changes are or somewhere else.

I'm working on yet another change at the moment...

@ggogel
Copy link
Contributor Author

ggogel commented Mar 8, 2024

I've finally found a solution that has good performance and makes mypy happy.

My initial working solution was the fastest because no copying was done at all. I simply assigned the dictionary to another variable. However, due to batched_auth_events being defined as Mapping and not as MutableMapping, it doesn't have an update function.

auth_events = batched_auth_events

Then I changed it to use dict() because of a type error in mypy. This is not simply a type cast, it creates a new dictionary by copying the old one.

auth_events = dict(batched_auth_events)

Since batched_auth_events can get extremely large, this can be problematic and cause the process to lock up for a while. The largest I've seen was 165k items. Depending on the room complexity this function can be called several thousand times. If it only yields every 1000th loop as implemented in #16903 this accumulates to 165 million individual item copy operations without yielding the processor once. Some debug logs:

2024-03-08 23:05:14,189 - synapse.event_auth - 192 - WARNING - _process_incoming_pdus_in_room_inner-0-$eZHQIOFFcm5JjBrgYZBNW_-5j5Q6KMohRc6DycXw5gs-$Hli_I4rl2cSIn4GUe_gLnwXsOy4LIHbldqBFPKqGMKc-$igefq-4nOjAhDi2Kc4ZSJHfXzXqP53EAn2dboERIsnQ-$p4HEbuvaaz4ssS06nUht6gjTeBJAaNWJdspPZBcP5vo- batched_auth_events length: 165451
2024-03-08 23:05:14,189 - synapse.event_auth - 211 - WARNING - _process_incoming_pdus_in_room_inner-0-$eZHQIOFFcm5JjBrgYZBNW_-5j5Q6KMohRc6DycXw5gs-$Hli_I4rl2cSIn4GUe_gLnwXsOy4LIHbldqBFPKqGMKc-$igefq-4nOjAhDi2Kc4ZSJHfXzXqP53EAn2dboERIsnQ-$p4HEbuvaaz4ssS06nUht6gjTeBJAaNWJdspPZBcP5vo- needed_auth_events length: 0

After that, I tried the following approach of copying each item one by one into a new dictionary. This was much faster, but still caused issues with the humongous room #matrix:matrix.org. It was locked for 2 minutes and got killed by the liveness probe. Joining this room seems to be the most difficult benchmark for synapse. All other rooms I tried, worked fine with this solution.

for key, value in batched_auth_events.items():
            auth_events[key] = value

I explored other options such as using the dictionary unpacking operator ** and the operator | to merge batched_auth_events and needed_auth_events into a new dict. This approach is also slow because again it copies every item. It's also slower than using .update().

As a side note: When using .update() it is important to update on the larger of both dictionaries. Since the two dictionaries in this function can't have any equal keys, the order doesn't matter here. In my tests, I always found that needed_auth_events was empty, while batched_auth_events can be very large. If we would execute update on needed_auth_events it would essentially again result in a full copy of batched_auth_events.

I finally ended up with this solution, which matches my initial solution but also satisfies mypy by using cast.

auth_events = cast(MutableMapping[str, "EventBase"], batched_auth_events)

@ggogel ggogel force-pushed the deepcopy-batched_auth_events branch from 44411a2 to 4a3e6c7 Compare March 9, 2024 07:18
@ggogel ggogel force-pushed the deepcopy-batched_auth_events branch from 4a3e6c7 to 5bd7d93 Compare March 10, 2024 08:43
@ggogel
Copy link
Contributor Author

ggogel commented Mar 11, 2024

Following a discussion with @erikjohnston, I've implemented a change to utilize ChainMap.

I've successfully tested the change on:

  • #ipfs-chatter:ipfs.io
  • #archlinux:matrix.org
  • #matrix:matrix.org

@erikjohnston
Copy link
Member

Thanks @ggogel for digging into this and trying out various of the fixes!

@erikjohnston erikjohnston merged commit 1f88790 into element-hq:develop Mar 12, 2024
38 checks passed
netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this pull request Apr 2, 2024
# Synapse 1.104.0 (2024-04-02)

### Bugfixes

- Fix regression when using OIDC provider. Introduced in v1.104.0rc1. ([\#17031](element-hq/synapse#17031))


# Synapse 1.104.0rc1 (2024-03-26)

### Features

- Add an OIDC config to specify extra parameters for the authorization grant URL. IT can be useful to pass an ACR value for example. ([\#16971](element-hq/synapse#16971))
- Add support for OIDC provider returning JWT. ([\#16972](element-hq/synapse#16972), [\#17031](element-hq/synapse#17031))

### Bugfixes

- Fix a bug which meant that, under certain circumstances, we might never retry sending events or to-device messages over federation after a failure. ([\#16925](element-hq/synapse#16925))
- Fix various long-standing bugs which could cause incorrect state to be returned from `/sync` in certain situations. ([\#16949](element-hq/synapse#16949))
- Fix case in which `m.fully_read` marker would not get updated. Contributed by @SpiritCroc. ([\#16990](element-hq/synapse#16990))
- Fix bug which did not retract a user's pending knocks at rooms when their account was deactivated. Contributed by @hanadi92. ([\#17010](element-hq/synapse#17010))

### Updates to the Docker image

- Updated `start.py` to generate config using the correct user ID when running as root (fixes [\#16824](element-hq/synapse#16824), [\#15202](element-hq/synapse#15202)). ([\#16978](element-hq/synapse#16978))

### Improved Documentation

- Add a query to force a refresh of a remote user's device list to the "Useful SQL for Admins" documentation page. ([\#16892](element-hq/synapse#16892))
- Minor grammatical corrections to the upgrade documentation. ([\#16965](element-hq/synapse#16965))
- Fix the sort order for the documentation version picker, so that newer releases appear above older ones. ([\#16966](element-hq/synapse#16966))
- Remove recommendation for a specific poetry version from contributing guide. ([\#17002](element-hq/synapse#17002))

### Internal Changes

- Improve lock performance when a lot of locks are all waiting for a single lock to be released. ([\#16840](element-hq/synapse#16840))
- Update power level default for public rooms. ([\#16907](element-hq/synapse#16907))
- Improve event validation. ([\#16908](element-hq/synapse#16908))
- Multi-worker-docker-container: disable log buffering. ([\#16919](element-hq/synapse#16919))
- Refactor state delta calculation in `/sync` handler. ([\#16929](element-hq/synapse#16929))
- Clarify docs for some room state functions. ([\#16950](element-hq/synapse#16950))
- Specify IP subnets in canonical form. ([\#16953](element-hq/synapse#16953))
- As done for SAML mapping provider, let's pass the module API to the OIDC one so the mapper can do more logic in its code. ([\#16974](element-hq/synapse#16974))
- Allow containers building on top of Synapse's Complement container is use the included PostgreSQL cluster. ([\#16985](element-hq/synapse#16985))
- Raise poetry-core version cap to 1.9.0. ([\#16986](element-hq/synapse#16986))
- Patch the db conn pool sooner in tests. ([\#17017](element-hq/synapse#17017))



### Updates to locked dependencies

* Bump anyhow from 1.0.80 to 1.0.81. ([\#17009](element-hq/synapse#17009))
* Bump black from 23.10.1 to 24.2.0. ([\#16936](element-hq/synapse#16936))
* Bump cryptography from 41.0.7 to 42.0.5. ([\#16958](element-hq/synapse#16958))
* Bump dawidd6/action-download-artifact from 3.1.1 to 3.1.2. ([\#16960](element-hq/synapse#16960))
* Bump dawidd6/action-download-artifact from 3.1.2 to 3.1.4. ([\#17008](element-hq/synapse#17008))
* Bump jinja2 from 3.1.2 to 3.1.3. ([\#17005](element-hq/synapse#17005))
* Bump log from 0.4.20 to 0.4.21. ([\#16977](element-hq/synapse#16977))
* Bump mypy from 1.5.1 to 1.8.0. ([\#16901](element-hq/synapse#16901))
* Bump netaddr from 0.9.0 to 1.2.1. ([\#17006](element-hq/synapse#17006))
* Bump pydantic from 2.6.0 to 2.6.4. ([\#17004](element-hq/synapse#17004))
* Bump pyo3 from 0.20.2 to 0.20.3. ([\#16962](element-hq/synapse#16962))
* Bump ruff from 0.1.14 to 0.3.2. ([\#16994](element-hq/synapse#16994))
* Bump serde from 1.0.196 to 1.0.197. ([\#16963](element-hq/synapse#16963))
* Bump serde_json from 1.0.113 to 1.0.114. ([\#16961](element-hq/synapse#16961))
* Bump types-jsonschema from 4.21.0.20240118 to 4.21.0.20240311. ([\#17007](element-hq/synapse#17007))
* Bump types-psycopg2 from 2.9.21.16 to 2.9.21.20240311. ([\#16995](element-hq/synapse#16995))
* Bump types-pyopenssl from 23.3.0.0 to 24.0.0.20240311. ([\#17003](element-hq/synapse#17003))

# Synapse 1.103.0 (2024-03-19)

No significant changes since 1.103.0rc1.




# Synapse 1.103.0rc1 (2024-03-12)

### Features

- Add a new [List Accounts v3](https://element-hq.github.io/synapse/v1.103/admin_api/user_admin_api.html#list-accounts-v3) Admin API with improved deactivated user filtering capabilities. ([\#16874](element-hq/synapse#16874))
- Include `Retry-After` header by default per [MSC4041](matrix-org/matrix-spec-proposals#4041). Contributed by @clokep. ([\#16947](element-hq/synapse#16947))

### Bugfixes

- Fix joining remote rooms when a module uses the `on_new_event` callback. This callback may now pass partial state events instead of the full state for remote rooms. Introduced in v1.76.0. ([\#16973](element-hq/synapse#16973))
- Fix performance issue when joining very large rooms that can cause the server to lock up. Introduced in v1.100.0. Contributed by @ggogel. ([\#16968](element-hq/synapse#16968))

### Improved Documentation

- Add HAProxy example for single port operation to reverse proxy documentation. Contributed by Georg Pfuetzenreuter (@tacerus). ([\#16768](element-hq/synapse#16768))
- Improve the documentation around running Complement tests with new configuration parameters. ([\#16946](element-hq/synapse#16946))
- Add docs on upgrading from a very old version. ([\#16951](element-hq/synapse#16951))


### Updates to locked dependencies

* Bump JasonEtco/create-an-issue from 2.9.1 to 2.9.2. ([\#16934](element-hq/synapse#16934))
* Bump anyhow from 1.0.79 to 1.0.80. ([\#16935](element-hq/synapse#16935))
* Bump dawidd6/action-download-artifact from 3.0.0 to 3.1.1. ([\#16933](element-hq/synapse#16933))
* Bump furo from 2023.9.10 to 2024.1.29. ([\#16939](element-hq/synapse#16939))
* Bump pyopenssl from 23.3.0 to 24.0.0. ([\#16937](element-hq/synapse#16937))
* Bump types-netaddr from 0.10.0.20240106 to 1.2.0.20240219. ([\#16938](element-hq/synapse#16938))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Synapse main worker locks up in sync_partial_state_room while persisting outliers
4 participants