-
Couldn't load subscription status.
- Fork 404
Fix LaterGauge metrics to collect from all servers
#18751
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix LaterGauge metrics to collect from all servers
#18751
Conversation
Same changelog as #18714 so they merge
changelog.d/18751.misc
Outdated
| @@ -0,0 +1 @@ | |||
| Refactor `LaterGauge` metrics to be homeserver-scoped. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same changelog as #18714 so they merge
See #18751 (comment) ``` ValueError: Invalid metric name: synapse_federation_send_queue_QueueNames.PRESENCE_MAP_size ```
…-tenant Conflicts: tests/metrics/test_metrics.py
| def _register(self) -> None: | ||
| if self.name in all_gauges.keys(): | ||
| logger.warning("%s already registered, reregistering", self.name) | ||
| REGISTRY.unregister(all_gauges.pop(self.name)) | ||
|
|
||
| REGISTRY.register(self) | ||
| all_gauges[self.name] = self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh - that also solves it.
@erikjohnston it looks like you last touched this "reregistering" stuff back 7 years ago.
Do you know if it's safe to blow it away like this? Or is there a reason for it being here that I'm unaware of?
Logically, it makes sense to me that we shouldn't be reregistering the same metrics multiple times within Synapse. Anything that does must surely be a bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine. I wonder if we should explicitly error if we try and reregister the same metric name then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
REGISTRY.register(...) will complain about duplicate metrics already ⏩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on our experience after merging this PR (see #18789 where this PR was reverted), I'm guessing this might have been added because we create many many homeservers in our trial tests (HomeserverTestCase). All of these metrics/hooks can stack up significantly enough to cause havoc in CI.
Changelog can no longer merge with the other one
| @@ -0,0 +1 @@ | |||
| Fix `LaterGauge` metrics to collect from all servers. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the changelog is already underway in https://github.com/element-hq/synapse/blob/release-v1.136/CHANGES.md, I'm going to leave this as-is as it will require manual tinkering anyway to pull into v1.136.0 and we don't need to re-run CI again.
The desire is for this changelog to merge with #18714
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me. I'm already manually tinkering away in there
|
Thank you for the review and release wrangling @devonh 🦟 |
This reverts commit 076db0a.
#18789) This PR reverts #18751 ### Why revert? @reivilibre [found](https://matrix.to/#/!vcyiEtMVHIhWXcJAfl:sw1v.org/$u9OEmMxaFYUzWHhCk1A_r50Y0aGrtKEhepF7WxWJkUA?via=matrix.org&via=node.marinchik.ink&via=element.io) that our CI was failing in bizarre ways (thanks for stepping up to dive into this 🙇). Examples: - `twisted.internet.error.ProcessTerminated: A process has ended with a probable error condition: process ended by signal 9.` - `twisted.internet.error.ProcessTerminated: A process has ended with a probable error condition: process ended by signal 15.` <details> <summary>More detailed part of the log</summary> https://github.com/element-hq/synapse/actions/runs/16758038107/job/47500520633#step:9:6809 ``` tests.util.test_wheel_timer.WheelTimerTestCase.test_single_insert_fetch =============================================================================== Error: Traceback (most recent call last): File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/trial/_dist/disttrial.py", line 371, in task await worker.run(case, result) File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/trial/_dist/worker.py", line 305, in run return await self.callRemote(workercommands.Run, testCase=testCaseId) # type: ignore[no-any-return] File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/internet/defer.py", line 1187, in __iter__ yield self File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/internet/defer.py", line 1092, in _runCallbacks current.result = callback( # type: ignore[misc] File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/protocols/amp.py", line 1968, in _massageError error.trap(RemoteAmpError) File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/python/failure.py", line 431, in trap self.raiseException() File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/python/failure.py", line 455, in raiseException raise self.value.with_traceback(self.tb) twisted.internet.error.ProcessTerminated: A process has ended with a probable error condition: process ended by signal 9. tests.util.test_macaroons.MacaroonGeneratorTestCase.test_guest_access_token ------------------------------------------------------------------------------- Ran 4325 tests in 669.321s FAILED (skips=159, errors=62, successes=4108) while calling from thread Traceback (most recent call last): File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/internet/base.py", line 1064, in runUntilCurrent f(*a, **kw) File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/internet/base.py", line 790, in stop raise error.ReactorNotRunning("Can't stop reactor that isn't running.") twisted.internet.error.ReactorNotRunning: Can't stop reactor that isn't running. joining disttrial worker #0 failed Traceback (most recent call last): File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/internet/defer.py", line 1853, in _inlineCallbacks result = context.run( File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/python/failure.py", line 467, in throwExceptionIntoGenerator return g.throw(self.value.with_traceback(self.tb)) File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/trial/_dist/worker.py", line 406, in exit await endDeferred File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/internet/defer.py", line 1187, in __iter__ yield self twisted.internet.error.ProcessTerminated: A process has ended with a probable error condition: process ended by signal 15. ``` </details> With more debugging (thanks @devonh for also stepping in as maintainer), we were finding that the CI was consistently failing at `test_exposed_to_prometheus` which was a bit of smoke because of all of the [metrics changes](#18592) that were merged recently. Locally, although I wasn't able to reproduce the bizarre errors, I could easily see increased memory usage (~20GB vs ~2GB) and the `test_exposed_to_prometheus` test taking a while to complete when running a full test run (`SYNAPSE_TEST_LOG_LEVEL=INFO poetry run trial tests`). <img width="1485" height="78" alt="Lots of memory usage" src="https://github.com/user-attachments/assets/811e2a96-75e5-4a3c-966c-00dc0512cea9" /> After updating `test_exposed_to_prometheus` to dump the `latest_metrics_response = generate_latest(REGISTRY)`, I could see that it's a massive 3.2GB response. Inspecting the contents, we can see 4.1M (4,137,123) entries for just `synapse_background_update_status{server_name="test"} 3.0` which is a `LaterGauge`. I don't think we have 4.1M test cases so it's also unclear why we end up with so many samples but it does make sense that we do see a lot of duplicates because each `HomeserverTestCase` will create a homeserver for each test case that will `LaterGauge.register_hook(...)` (part of the #18751 changes). `tests/storage/databases/main/test_metrics.py` ```python latest_metrics_response = generate_latest(REGISTRY) with open("/tmp/synapse-test-metrics", "wb") as f: f.write(latest_metrics_response) ``` After reverting the #18751 changes, running the full test suite locally doesn't result in memory spikes and seems to run normally. ### Dev notes Discussion in the [`#synapse-dev:matrix.org`](https://matrix.to/#/!vcyiEtMVHIhWXcJAfl:sw1v.org/$vkMATs04yqZggVVd6Noop5nU8M2DVoTkrAWshw7u1-w?via=matrix.org&via=node.marinchik.ink&via=element.io) room. ### Pull Request Checklist <!-- Please read https://element-hq.github.io/synapse/latest/development/contributing_guide.html before submitting your pull request --> * [x] Pull request is based on the develop branch * [ ] Pull request includes a [changelog file](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#changelog). The entry should: - Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from `EventStore` to `EventWorkerStore`.". - Use markdown where necessary, mostly for `code blocks`. - End with either a period (.) or an exclamation mark (!). - Start with a capital letter. - Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry. * [ ] [Code style](https://element-hq.github.io/synapse/latest/code_style.html) is correct (run the [linters](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#run-the-linters))
Fix `LaterGauge` metrics to collect from all servers Follow-up to #18714 Previously, our `LaterGauge` metrics did include the `server_name` label as expected but we were only seeing the last server being reported in some cases. Any `LaterGauge` that we were creating multiple times was only reporting the last instance. This PR updates all `LaterGauge` to be created once and then we use `LaterGauge.register_hook(...)` to add in the metric callback as before. This works now because we store a list of callbacks instead of just one. I noticed this problem thanks to some [tests in the Synapse Pro for Small Hosts](element-hq/synapse-small-hosts#173) repo that sanity check all metrics to ensure that we can see each metric includes data from multiple servers. ### Testing strategy 1. This is only noticeable when you run multiple Synapse instances in the same process. 1. TODO (see test that was added) ### Dev notes Previous non-global `LaterGauge`: ``` synapse_federation_send_queue_xxx synapse_federation_transaction_queue_pending_destinations synapse_federation_transaction_queue_pending_pdus synapse_federation_transaction_queue_pending_edus synapse_handlers_presence_user_to_current_state_size synapse_handlers_presence_wheel_timer_size synapse_notifier_listeners synapse_notifier_rooms synapse_notifier_users synapse_replication_tcp_resource_total_connections synapse_replication_tcp_command_queue synapse_background_update_status synapse_federation_known_servers synapse_scheduler_running_tasks ``` ### Pull Request Checklist <!-- Please read https://element-hq.github.io/synapse/latest/development/contributing_guide.html before submitting your pull request --> * [x] Pull request is based on the develop branch * [x] Pull request includes a [changelog file](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#changelog). The entry should: - Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from `EventStore` to `EventWorkerStore`.". - Use markdown where necessary, mostly for `code blocks`. - End with either a period (.) or an exclamation mark (!). - Start with a capital letter. - Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry. * [x] [Code style](https://element-hq.github.io/synapse/latest/code_style.html) is correct (run the [linters](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#run-the-linters))
…18791) Re-introduce: #18751 that was reverted in #18789 (explains why the PR was reverted in the first place). - Adds a `cleanup` pattern that cleans up metrics from each homeserver in the tests. Previously, the list of hooks built up until our CI machines couldn't operate properly, see #18789 - Fix long-standing issue with `synapse_background_update_status` metrics only tracking the last database listed in the config (see #18791 (comment))
…18791) Re-introduce: element-hq/synapse#18751 that was reverted in element-hq/synapse#18789 (explains why the PR was reverted in the first place). - Adds a `cleanup` pattern that cleans up metrics from each homeserver in the tests. Previously, the list of hooks built up until our CI machines couldn't operate properly, see element-hq/synapse#18789 - Fix long-standing issue with `synapse_background_update_status` metrics only tracking the last database listed in the config (see element-hq/synapse#18791 (comment))
Fix
LaterGaugemetrics to collect from all serversFollow-up to #18714
Previously, our
LaterGaugemetrics did include theserver_namelabel as expected but we were only seeing the last server being reported in some cases. AnyLaterGaugethat we were creating multiple times was only reporting the last instance.This PR updates all
LaterGaugeto be created once and then we useLaterGauge.register_hook(...)to add in the metric callback as before. This works now because we store a list of callbacks instead of just one.I noticed this problem thanks to some tests in the Synapse Pro for Small Hosts repo that sanity check all metrics to ensure that we can see each metric includes data from multiple servers.
Testing strategy
(see test that was added)
Dev notes
Previous non-global
LaterGauge:Pull Request Checklist
EventStoretoEventWorkerStore.".code blocks.