Revert "Fix `LaterGauge` metrics to collect from all servers (#18751)" #18789

MadLittleMods · 2025-08-06T21:51:15Z

This PR reverts #18751

Why revert?

@reivilibre found that our CI was failing in bizarre ways (thanks for stepping up to dive into this 🙇). Examples:

twisted.internet.error.ProcessTerminated: A process has ended with a probable error condition: process ended by signal 9.
twisted.internet.error.ProcessTerminated: A process has ended with a probable error condition: process ended by signal 15.

More detailed part of the log

https://github.com/element-hq/synapse/actions/runs/16758038107/job/47500520633#step:9:6809

tests.util.test_wheel_timer.WheelTimerTestCase.test_single_insert_fetch
===============================================================================
Error: 
Traceback (most recent call last):
  File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/trial/_dist/disttrial.py", line 371, in task
    await worker.run(case, result)
  File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/trial/_dist/worker.py", line 305, in run
    return await self.callRemote(workercommands.Run, testCase=testCaseId)  # type: ignore[no-any-return]
  File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/internet/defer.py", line 1187, in __iter__
    yield self
  File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/internet/defer.py", line 1092, in _runCallbacks
    current.result = callback(  # type: ignore[misc]
  File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/protocols/amp.py", line 1968, in _massageError
    error.trap(RemoteAmpError)
  File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/python/failure.py", line 431, in trap
    self.raiseException()
  File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/python/failure.py", line 455, in raiseException
    raise self.value.with_traceback(self.tb)
twisted.internet.error.ProcessTerminated: A process has ended with a probable error condition: process ended by signal 9.

tests.util.test_macaroons.MacaroonGeneratorTestCase.test_guest_access_token
-------------------------------------------------------------------------------
Ran 4325 tests in 669.321s

FAILED (skips=159, errors=62, successes=4108)
while calling from thread
Traceback (most recent call last):
  File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/internet/base.py", line 1064, in runUntilCurrent
    f(*a, **kw)
  File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/internet/base.py", line 790, in stop
    raise error.ReactorNotRunning("Can't stop reactor that isn't running.")
twisted.internet.error.ReactorNotRunning: Can't stop reactor that isn't running.

joining disttrial worker #0 failed
Traceback (most recent call last):
  File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/internet/defer.py", line 1853, in _inlineCallbacks
    result = context.run(
  File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/python/failure.py", line 467, in throwExceptionIntoGenerator
    return g.throw(self.value.with_traceback(self.tb))
  File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/trial/_dist/worker.py", line 406, in exit
    await endDeferred
  File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/internet/defer.py", line 1187, in __iter__
    yield self
twisted.internet.error.ProcessTerminated: A process has ended with a probable error condition: process ended by signal 15.

With more debugging (thanks @devonh for also stepping in as maintainer), we were finding that the CI was consistently failing at test_exposed_to_prometheus which was a bit of smoke because of all of the metrics changes that were merged recently.

Locally, although I wasn't able to reproduce the bizarre errors, I could easily see increased memory usage (~20GB vs ~2GB) and the test_exposed_to_prometheus test taking a while to complete when running a full test run (SYNAPSE_TEST_LOG_LEVEL=INFO poetry run trial tests).

After updating test_exposed_to_prometheus to dump the latest_metrics_response = generate_latest(REGISTRY), I could see that it's a massive 3.2GB response. Inspecting the contents, we can see 4.1M (4,137,123) entries for just synapse_background_update_status{server_name="test"} 3.0 which is a LaterGauge. I don't think we have 4.1M test cases so it's also unclear why we end up with so many samples but it does make sense that we do see a lot of duplicates because each HomeserverTestCase will create a homeserver for each test case that will call LaterGauge.register_hook(...) (part of the #18751 changes).

tests/storage/databases/main/test_metrics.py (simple code for dumping the metrics response)

        latest_metrics_response = generate_latest(REGISTRY)
        with open("/tmp/synapse-test-metrics", "wb") as f:
            f.write(latest_metrics_response)

After reverting the #18751 changes, running the full test suite locally doesn't result in memory spikes and seems to run normally.

Dev notes

Discussion in the #synapse-dev:matrix.org room.

Pull Request Checklist

Pull request is based on the develop branch
Pull request includes a changelog file. The entry should:
- Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
- Use markdown where necessary, mostly for code blocks.
- End with either a period (.) or an exclamation mark (!).
- Start with a capital letter.
- Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
Code style is correct (run the linters)

This reverts commit 076db0a.

changelog.d/18751.misc

…18791) Re-introduce: #18751 that was reverted in #18789 (explains why the PR was reverted in the first place). - Adds a `cleanup` pattern that cleans up metrics from each homeserver in the tests. Previously, the list of hooks built up until our CI machines couldn't operate properly, see #18789 - Fix long-standing issue with `synapse_background_update_status` metrics only tracking the last database listed in the config (see #18791 (comment))

…18791) Re-introduce: element-hq/synapse#18751 that was reverted in element-hq/synapse#18789 (explains why the PR was reverted in the first place). - Adds a `cleanup` pattern that cleans up metrics from each homeserver in the tests. Previously, the list of hooks built up until our CI machines couldn't operate properly, see element-hq/synapse#18789 - Fix long-standing issue with `synapse_background_update_status` metrics only tracking the last database listed in the config (see element-hq/synapse#18791 (comment))

Revert "Fix LaterGauge metrics to collect from all servers (#18751)"

c012907

This reverts commit 076db0a.

MadLittleMods commented Aug 6, 2025

View reviewed changes

changelog.d/18751.misc Show resolved Hide resolved

MadLittleMods marked this pull request as ready for review August 6, 2025 22:10

MadLittleMods requested a review from a team as a code owner August 6, 2025 22:10

devonh approved these changes Aug 6, 2025

View reviewed changes

devonh merged commit ff03a51 into develop Aug 6, 2025
33 of 36 checks passed

devonh deleted the madlittlemods/revert-18751 branch August 6, 2025 22:14

MadLittleMods mentioned this pull request Aug 6, 2025

Fix LaterGauge metrics to collect from all servers #18751

Merged

3 tasks

devonh mentioned this pull request Aug 6, 2025

CI debugging. #18787

Closed

MadLittleMods mentioned this pull request Aug 7, 2025

Re-introduce: Fix LaterGauge metrics to collect from all servers #18791

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Revert "Fix `LaterGauge` metrics to collect from all servers (#18751)" #18789

Revert "Fix `LaterGauge` metrics to collect from all servers (#18751)" #18789

Uh oh!

MadLittleMods commented Aug 6, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Revert "Fix LaterGauge metrics to collect from all servers (#18751)" #18789

Revert "Fix LaterGauge metrics to collect from all servers (#18751)" #18789

Uh oh!

Conversation

MadLittleMods commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why revert?

Dev notes

Pull Request Checklist

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Revert "Fix `LaterGauge` metrics to collect from all servers (#18751)" #18789

Revert "Fix `LaterGauge` metrics to collect from all servers (#18751)" #18789

MadLittleMods commented Aug 6, 2025 •

edited

Loading