Re-introduce: Fix `LaterGauge` metrics to collect from all servers #18791

MadLittleMods · 2025-08-07T21:49:39Z

Re-introduce: #18751 that was reverted in #18789 (explains why the PR was reverted in the first place).

Adds a cleanup pattern that cleans up metrics from each homeserver in the tests. Previously, the list of hooks built up until our CI machines couldn't operate properly, see Revert "Fix LaterGauge metrics to collect from all servers (#18751)" #18789
Fix long-standing issue with synapse_background_update_status metrics only tracking the last database listed in the config (see Re-introduce: Fix LaterGauge metrics to collect from all servers #18791 (comment))

Testing strategy

Run a full test suite: SYNAPSE_TEST_LOG_LEVEL=INFO poetry run trial tests
Run a full test suite with Postgres: SYNAPSE_POSTGRES=1 SYNAPSE_POSTGRES_USER=postgres SYNAPSE_TEST_LOG_LEVEL=INFO poetry run trial tests
Ensure it finishes without ballooning memory or the metric tests like test_exposed_to_prometheus taking long.

Dev notes

SYNAPSE_POSTGRES=1 SYNAPSE_POSTGRES_USER=postgres SYNAPSE_TEST_LOG_LEVEL=INFO poetry run trial tests.util.test_task_scheduler.TestTaskSchedulerWithBackgroundWorker.test_schedule_task

SYNAPSE_POSTGRES=1 SYNAPSE_POSTGRES_USER=postgres SYNAPSE_TEST_LOG_LEVEL=INFO poetry run trial tests.replication.tcp.streams.test_federation.FederationStreamTestCase.test_catchup

Pull Request Checklist

Pull request is based on the develop branch
Pull request includes a changelog file. The entry should:
- Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
- Use markdown where necessary, mostly for code blocks.
- End with either a period (.) or an exclamation mark (!).
- Start with a capital letter.
- Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
Code style is correct (run the linters)

Fix `LaterGauge` metrics to collect from all servers Follow-up to #18714 Previously, our `LaterGauge` metrics did include the `server_name` label as expected but we were only seeing the last server being reported in some cases. Any `LaterGauge` that we were creating multiple times was only reporting the last instance. This PR updates all `LaterGauge` to be created once and then we use `LaterGauge.register_hook(...)` to add in the metric callback as before. This works now because we store a list of callbacks instead of just one. I noticed this problem thanks to some [tests in the Synapse Pro for Small Hosts](element-hq/synapse-small-hosts#173) repo that sanity check all metrics to ensure that we can see each metric includes data from multiple servers. ### Testing strategy 1. This is only noticeable when you run multiple Synapse instances in the same process. 1. TODO (see test that was added) ### Dev notes Previous non-global `LaterGauge`: ``` synapse_federation_send_queue_xxx synapse_federation_transaction_queue_pending_destinations synapse_federation_transaction_queue_pending_pdus synapse_federation_transaction_queue_pending_edus synapse_handlers_presence_user_to_current_state_size synapse_handlers_presence_wheel_timer_size synapse_notifier_listeners synapse_notifier_rooms synapse_notifier_users synapse_replication_tcp_resource_total_connections synapse_replication_tcp_command_queue synapse_background_update_status synapse_federation_known_servers synapse_scheduler_running_tasks ``` ### Pull Request Checklist  * [x] Pull request is based on the develop branch * [x] Pull request includes a [changelog file](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#changelog). The entry should: - Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from `EventStore` to `EventWorkerStore`.". - Use markdown where necessary, mostly for `code blocks`. - End with either a period (.) or an exclamation mark (!). - Start with a capital letter. - Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry. * [x] [Code style](https://element-hq.github.io/synapse/latest/code_style.html) is correct (run the [linters](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#run-the-linters))

tests/server.py

synapse/server.py

We use `instance_id` instead of `server_name` because it's possible to have multiple workers running in the same process with the same `server_name`.

Multiple `ReplicationCommandHandler` were getting created for the same HS registering the hook twice ``` 2025-08-07 18:38:14-0500 [-] synapse.replication.tcp.handler - 386 - ERROR - process-replication-data-11 - Failed to handle command <synapse.replication.tcp.commands.PositionCommand object at 0x7fb5fbe04320> Traceback (most recent call last): File "synapse/replication/tcp/handler.py", line 384, in _unsafe_process_queue await self._process_command(cmd, conn, stream_name) File "synapse/replication/tcp/handler.py", line 397, in _process_command await self._process_position(stream_name, conn, cmd) File "synapse/replication/tcp/handler.py", line 721, in _process_position (updates, current_token, missing_updates) = await stream.get_updates_since( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ cmd.instance_name, current_token, cmd.new_token ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ) ^ File "synapse/replication/tcp/streams/_base.py", line 213, in get_updates_since updates, upto_token, limited = await self.update_function( ^^^^^^^^^^^^^^^^^^^^^^^^^^^ ...<4 lines>... ) ^ File "synapse/replication/tcp/streams/_base.py", line 269, in update_function result = await client( ^^^^^^^^^^^^^ ...<4 lines>... ) ^ File "synapse/logging/opentracing.py", line 929, in _wrapper return await func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "synapse/replication/http/_base.py", line 232, in send_request streams = hs.get_replication_command_handler().get_streams_to_replicate() ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^ File "synapse/server.py", line 232, in _get dep = builder(self) File "synapse/server.py", line 756, in get_replication_command_handler return ReplicationCommandHandler(self) File "synapse/replication/tcp/handler.py", line 258, in __init__ tcp_resource_total_connections_gauge.register_hook( ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^ homeserver_instance_id=hs.get_instance_id(), ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ hook=lambda: {(self.server_name,): len(self._connections)}, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ) ^ File "synapse/metrics/__init__.py", line 225, in register_hook assert existing_hook is None, ( ^^^^^^^^^^^^^^^^^^^^^ AssertionError: LaterGauge(name=synapse_replication_tcp_resource_total_connections) hook already registered for homeserver_instance_id=gZBmA. This is likely a Synapse bug and you forgot to unregister the previous hooks for the server (especially in tests). ```

synapse/storage/database.py

See #18791 (comment) This also fixes a long-standing problem where we would only track metrics for the last database listed.

MadLittleMods · 2025-08-08T00:39:33Z

synapse/storage/databases/__init__.py

+        # Track the background update status for each database
+        background_update_status.register_hook(
+            homeserver_instance_id=hs.get_instance_id(),
+            hook=lambda: {
+                (database.name(), server_name): database.updates.get_status()
+                for database in self.databases
+            },
+        )


Moved this out of the DatabasePool class because we create multiple DatabasePool when multiple databases are specified in the config. Previously, this would cause us to call register_hook multiple times for the same server_name which we assert to avoid dumb mistakes of not using the hs.get_xxx singletons. See #18791 (comment) for more background and other possible alternatives.

This also fixes a long-standing problem where we would only track metrics for the last database listed because LaterGauge previously just clobbered any previous gauge (notice the database_name label that we've added now).

MadLittleMods · 2025-08-08T00:42:06Z

synapse/server.py

+    def cleanup(self) -> None:
+        """
+        WIP: Clean-up any references to the homeserver and stop any running related
+        processes, timers, loops, replication stream, etc.
+        """
+        logger.info("Received cleanup request for %s.", self.hostname)
+
+        # TODO: Stop background processes, timers, loops, replication stream, etc.
+
+        # Cleanup metrics associated with the homeserver
+        for later_gauge in all_later_gauges_to_clean_up_on_shutdown.values():
+            later_gauge.unregister_hooks_for_homeserver_instance_id(
+                self.get_instance_id()
+            )
+
+        logger.info("Cleanup complete for %s.", self.hostname)


The major new part is this piece that cleans up metrics from each homeserver in the tests.

Previously, the list of hooks built up until our CI machines couldn't operate properly, see #18789

devonh

These changes look like they address the memory blowup issue that the previous iteration of this PR had.
Do you think we should have another set of eyes go over the changes here?

synapse/_scripts/synapse_port_db.py

MadLittleMods · 2025-08-12T15:22:32Z

changelog.d/18791.misc

@@ -0,0 +1 @@
+Fix `LaterGauge` metrics to collect from all servers.


Do you think we should have another set of eyes go over the changes here?

-- @devonh, #18791 (review)

Sounds like a good idea given the new cleanup pattern and you were the same one that reviewed the original PR.

synapse/_scripts/synapse_port_db.py

synapse/server.py

synapse/metrics/__init__.py

See #18791 (comment)

Previously, we we're yielding `g` multiple times while repeatedly adding extra metrics to the same `g` instance. See #18791 (comment)

See #18791 (comment)

synapse/metrics/__init__.py

See #18791 (comment)

I think this was accidentally removed before. Testing strategy: ``` poetry run generate_workers_map --config-path homeserver.yaml ```

MadLittleMods · 2025-09-02T17:14:51Z

Thanks for the review @devonh and @erikjohnston 🐊

…18791) Re-introduce: element-hq/synapse#18751 that was reverted in element-hq/synapse#18789 (explains why the PR was reverted in the first place). - Adds a `cleanup` pattern that cleans up metrics from each homeserver in the tests. Previously, the list of hooks built up until our CI machines couldn't operate properly, see element-hq/synapse#18789 - Fix long-standing issue with `synapse_background_update_status` metrics only tracking the last database listed in the config (see element-hq/synapse#18791 (comment))

MadLittleMods added 5 commits August 6, 2025 17:33

First pass: keep track of hooks per server

4dda2b1

Some fix-ups

505263d

Add server_name to register_hook

46e4053

Move metric clean-up to catch all servers

2e5ac44

MadLittleMods added the A-Metrics label Aug 7, 2025

MadLittleMods added 2 commits August 7, 2025 16:50

Update changelog number

2f25722

Merge branch 'develop' into madlittlemods/re-introduce-18751

8b98536

MadLittleMods commented Aug 7, 2025

View reviewed changes

tests/server.py Outdated Show resolved Hide resolved

MadLittleMods added 2 commits August 7, 2025 16:59

Try fix synapse/_scripts/synapse_port_db.py script

a34122f

Cleanup homeservers when we synapse/_scripts/generate_workers_map.py

b226f63

MadLittleMods commented Aug 7, 2025

View reviewed changes

synapse/server.py Show resolved Hide resolved

MadLittleMods added 4 commits August 7, 2025 17:21

Fix lints

75e7463

Use instance_id instead of server_name to track metrics

68bb036

We use `instance_id` instead of `server_name` because it's possible to have multiple workers running in the same process with the same `server_name`.

Fix lints

c7d1a78

MadLittleMods commented Aug 8, 2025

View reviewed changes

synapse/storage/database.py Outdated Show resolved Hide resolved

Fix multiple databases registering metrics

14aee2f

See #18791 (comment) This also fixes a long-standing problem where we would only track metrics for the last database listed.

MadLittleMods commented Aug 8, 2025

View reviewed changes

MadLittleMods marked this pull request as ready for review August 8, 2025 06:04

MadLittleMods requested a review from a team as a code owner August 8, 2025 06:04

devonh reviewed Aug 11, 2025

View reviewed changes

synapse/_scripts/synapse_port_db.py Outdated Show resolved Hide resolved

MadLittleMods commented Aug 12, 2025

View reviewed changes

MadLittleMods requested a review from a team August 12, 2025 15:22

MadLittleMods added 2 commits August 12, 2025 16:44

Merge branch 'develop' into madlittlemods/re-introduce-18751

5baa576

Merge branch 'develop' into madlittlemods/re-introduce-18751

d7946f4

MadLittleMods mentioned this pull request Aug 22, 2025

Cleanly shutdown SynapseHomeServer object #18828

Merged

3 tasks

erikjohnston reviewed Aug 27, 2025

View reviewed changes

synapse/_scripts/synapse_port_db.py Outdated Show resolved Hide resolved

synapse/server.py Show resolved Hide resolved

synapse/server.py Show resolved Hide resolved

synapse/metrics/__init__.py Outdated Show resolved Hide resolved

synapse/metrics/__init__.py Outdated Show resolved Hide resolved

Merge branch 'develop' into madlittlemods/re-introduce-18751

635dcce

MadLittleMods added 6 commits August 27, 2025 13:34

Create consistent instance_id in MockHomeserver

59701ed

See #18791 (comment)

Note when cleanup should be called

309a72d

See #18791 (comment)

cleanup when homeserver is garbage collected

371aa51

See #18791 (comment)

Only yield the metric once when we collect

08755ae

Previously, we we're yielding `g` multiple times while repeatedly adding extra metrics to the same `g` instance. See #18791 (comment)

Continue to return metrics that aren't broken

31ad15a

See #18791 (comment)

Add tests to ensure we get all metrics even if one hook throws exception

b4f06b2

MadLittleMods requested a review from erikjohnston August 27, 2025 20:02

erikjohnston approved these changes Sep 2, 2025

View reviewed changes

synapse/metrics/__init__.py Outdated Show resolved Hide resolved

MadLittleMods added 3 commits September 2, 2025 09:58

Merge branch 'develop' into madlittlemods/re-introduce-18751

9e07d37

Don't double yield the same gauge

1c1d6c2

See #18791 (comment)

Fix generate_workers_map script erroring out

595c174

I think this was accidentally removed before. Testing strategy: ``` poetry run generate_workers_map --config-path homeserver.yaml ```

MadLittleMods merged commit bff4a11 into develop Sep 2, 2025
44 checks passed

MadLittleMods deleted the madlittlemods/re-introduce-18751 branch September 2, 2025 17:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Re-introduce: Fix `LaterGauge` metrics to collect from all servers #18791

Re-introduce: Fix `LaterGauge` metrics to collect from all servers #18791

Uh oh!

MadLittleMods commented Aug 7, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MadLittleMods Aug 8, 2025

Uh oh!

MadLittleMods Aug 8, 2025

Uh oh!

devonh left a comment

Uh oh!

Uh oh!

MadLittleMods Aug 12, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MadLittleMods commented Sep 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		@@ -0,0 +1 @@
		Fix `LaterGauge` metrics to collect from all servers.

Re-introduce: Fix LaterGauge metrics to collect from all servers #18791

Re-introduce: Fix LaterGauge metrics to collect from all servers #18791

Uh oh!

Conversation

MadLittleMods commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing strategy

Dev notes

Pull Request Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MadLittleMods Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

MadLittleMods Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

devonh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MadLittleMods Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MadLittleMods commented Sep 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Re-introduce: Fix `LaterGauge` metrics to collect from all servers #18791

Re-introduce: Fix `LaterGauge` metrics to collect from all servers #18791

MadLittleMods commented Aug 7, 2025 •

edited

Loading