-
Notifications
You must be signed in to change notification settings - Fork 407
Re-introduce: Fix LaterGauge metrics to collect from all servers
#18791
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Fix `LaterGauge` metrics to collect from all servers Follow-up to #18714 Previously, our `LaterGauge` metrics did include the `server_name` label as expected but we were only seeing the last server being reported in some cases. Any `LaterGauge` that we were creating multiple times was only reporting the last instance. This PR updates all `LaterGauge` to be created once and then we use `LaterGauge.register_hook(...)` to add in the metric callback as before. This works now because we store a list of callbacks instead of just one. I noticed this problem thanks to some [tests in the Synapse Pro for Small Hosts](element-hq/synapse-small-hosts#173) repo that sanity check all metrics to ensure that we can see each metric includes data from multiple servers. ### Testing strategy 1. This is only noticeable when you run multiple Synapse instances in the same process. 1. TODO (see test that was added) ### Dev notes Previous non-global `LaterGauge`: ``` synapse_federation_send_queue_xxx synapse_federation_transaction_queue_pending_destinations synapse_federation_transaction_queue_pending_pdus synapse_federation_transaction_queue_pending_edus synapse_handlers_presence_user_to_current_state_size synapse_handlers_presence_wheel_timer_size synapse_notifier_listeners synapse_notifier_rooms synapse_notifier_users synapse_replication_tcp_resource_total_connections synapse_replication_tcp_command_queue synapse_background_update_status synapse_federation_known_servers synapse_scheduler_running_tasks ``` ### Pull Request Checklist <!-- Please read https://element-hq.github.io/synapse/latest/development/contributing_guide.html before submitting your pull request --> * [x] Pull request is based on the develop branch * [x] Pull request includes a [changelog file](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#changelog). The entry should: - Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from `EventStore` to `EventWorkerStore`.". - Use markdown where necessary, mostly for `code blocks`. - End with either a period (.) or an exclamation mark (!). - Start with a capital letter. - Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry. * [x] [Code style](https://element-hq.github.io/synapse/latest/code_style.html) is correct (run the [linters](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#run-the-linters))
We use `instance_id` instead of `server_name` because it's possible to have multiple workers running in the same process with the same `server_name`.
Multiple `ReplicationCommandHandler` were getting created for the same HS
registering the hook twice
```
2025-08-07 18:38:14-0500 [-] synapse.replication.tcp.handler - 386 - ERROR - process-replication-data-11 - Failed to handle command <synapse.replication.tcp.commands.PositionCommand object at 0x7fb5fbe04320>
Traceback (most recent call last):
File "synapse/replication/tcp/handler.py", line 384, in _unsafe_process_queue
await self._process_command(cmd, conn, stream_name)
File "synapse/replication/tcp/handler.py", line 397, in _process_command
await self._process_position(stream_name, conn, cmd)
File "synapse/replication/tcp/handler.py", line 721, in _process_position
(updates, current_token, missing_updates) = await stream.get_updates_since(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
cmd.instance_name, current_token, cmd.new_token
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "synapse/replication/tcp/streams/_base.py", line 213, in get_updates_since
updates, upto_token, limited = await self.update_function(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
...<4 lines>...
)
^
File "synapse/replication/tcp/streams/_base.py", line 269, in update_function
result = await client(
^^^^^^^^^^^^^
...<4 lines>...
)
^
File "synapse/logging/opentracing.py", line 929, in _wrapper
return await func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "synapse/replication/http/_base.py", line 232, in send_request
streams = hs.get_replication_command_handler().get_streams_to_replicate()
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^
File "synapse/server.py", line 232, in _get
dep = builder(self)
File "synapse/server.py", line 756, in get_replication_command_handler
return ReplicationCommandHandler(self)
File "synapse/replication/tcp/handler.py", line 258, in __init__
tcp_resource_total_connections_gauge.register_hook(
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
homeserver_instance_id=hs.get_instance_id(),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
hook=lambda: {(self.server_name,): len(self._connections)},
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "synapse/metrics/__init__.py", line 225, in register_hook
assert existing_hook is None, (
^^^^^^^^^^^^^^^^^^^^^
AssertionError: LaterGauge(name=synapse_replication_tcp_resource_total_connections) hook already registered for homeserver_instance_id=gZBmA. This is likely a Synapse bug and you forgot to unregister the previous hooks for the server (especially in tests).
```
See #18791 (comment) This also fixes a long-standing problem where we would only track metrics for the last database listed.
| # Track the background update status for each database | ||
| background_update_status.register_hook( | ||
| homeserver_instance_id=hs.get_instance_id(), | ||
| hook=lambda: { | ||
| (database.name(), server_name): database.updates.get_status() | ||
| for database in self.databases | ||
| }, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved this out of the DatabasePool class because we create multiple DatabasePool when multiple databases are specified in the config. Previously, this would cause us to call register_hook multiple times for the same server_name which we assert to avoid dumb mistakes of not using the hs.get_xxx singletons. See #18791 (comment) for more background and other possible alternatives.
This also fixes a long-standing problem where we would only track metrics for the last database listed because LaterGauge previously just clobbered any previous gauge (notice the database_name label that we've added now).
| def cleanup(self) -> None: | ||
| """ | ||
| WIP: Clean-up any references to the homeserver and stop any running related | ||
| processes, timers, loops, replication stream, etc. | ||
| """ | ||
| logger.info("Received cleanup request for %s.", self.hostname) | ||
|
|
||
| # TODO: Stop background processes, timers, loops, replication stream, etc. | ||
|
|
||
| # Cleanup metrics associated with the homeserver | ||
| for later_gauge in all_later_gauges_to_clean_up_on_shutdown.values(): | ||
| later_gauge.unregister_hooks_for_homeserver_instance_id( | ||
| self.get_instance_id() | ||
| ) | ||
|
|
||
| logger.info("Cleanup complete for %s.", self.hostname) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The major new part is this piece that cleans up metrics from each homeserver in the tests.
Previously, the list of hooks built up until our CI machines couldn't operate properly, see #18789
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes look like they address the memory blowup issue that the previous iteration of this PR had.
Do you think we should have another set of eyes go over the changes here?
| @@ -0,0 +1 @@ | |||
| Fix `LaterGauge` metrics to collect from all servers. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we should have another set of eyes go over the changes here?
Sounds like a good idea given the new cleanup pattern and you were the same one that reviewed the original PR.
Previously, we we're yielding `g` multiple times while repeatedly adding extra metrics to the same `g` instance. See #18791 (comment)
I think this was accidentally removed before. Testing strategy: ``` poetry run generate_workers_map --config-path homeserver.yaml ```
|
Thanks for the review @devonh and @erikjohnston 🐊 |
…18791) Re-introduce: element-hq/synapse#18751 that was reverted in element-hq/synapse#18789 (explains why the PR was reverted in the first place). - Adds a `cleanup` pattern that cleans up metrics from each homeserver in the tests. Previously, the list of hooks built up until our CI machines couldn't operate properly, see element-hq/synapse#18789 - Fix long-standing issue with `synapse_background_update_status` metrics only tracking the last database listed in the config (see element-hq/synapse#18791 (comment))
Re-introduce: #18751 that was reverted in #18789 (explains why the PR was reverted in the first place).
cleanuppattern that cleans up metrics from each homeserver in the tests. Previously, the list of hooks built up until our CI machines couldn't operate properly, see Revert "FixLaterGaugemetrics to collect from all servers (#18751)" #18789synapse_background_update_statusmetrics only tracking the last database listed in the config (see Re-introduce: FixLaterGaugemetrics to collect from all servers #18791 (comment))Part of #18592
Testing strategy
SYNAPSE_TEST_LOG_LEVEL=INFO poetry run trial testsSYNAPSE_POSTGRES=1 SYNAPSE_POSTGRES_USER=postgres SYNAPSE_TEST_LOG_LEVEL=INFO poetry run trial teststest_exposed_to_prometheustaking long.Dev notes
Pull Request Checklist
EventStoretoEventWorkerStore.".code blocks.