Skip to content

Conversation

@MadLittleMods
Copy link
Contributor

@MadLittleMods MadLittleMods commented Aug 6, 2025

This PR reverts #18751

Why revert?

@reivilibre found that our CI was failing in bizarre ways (thanks for stepping up to dive into this 🙇). Examples:

  • twisted.internet.error.ProcessTerminated: A process has ended with a probable error condition: process ended by signal 9.
  • twisted.internet.error.ProcessTerminated: A process has ended with a probable error condition: process ended by signal 15.
More detailed part of the log

https://github.com/element-hq/synapse/actions/runs/16758038107/job/47500520633#step:9:6809

tests.util.test_wheel_timer.WheelTimerTestCase.test_single_insert_fetch
===============================================================================
Error: 
Traceback (most recent call last):
  File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/trial/_dist/disttrial.py", line 371, in task
    await worker.run(case, result)
  File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/trial/_dist/worker.py", line 305, in run
    return await self.callRemote(workercommands.Run, testCase=testCaseId)  # type: ignore[no-any-return]
  File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/internet/defer.py", line 1187, in __iter__
    yield self
  File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/internet/defer.py", line 1092, in _runCallbacks
    current.result = callback(  # type: ignore[misc]
  File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/protocols/amp.py", line 1968, in _massageError
    error.trap(RemoteAmpError)
  File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/python/failure.py", line 431, in trap
    self.raiseException()
  File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/python/failure.py", line 455, in raiseException
    raise self.value.with_traceback(self.tb)
twisted.internet.error.ProcessTerminated: A process has ended with a probable error condition: process ended by signal 9.

tests.util.test_macaroons.MacaroonGeneratorTestCase.test_guest_access_token
-------------------------------------------------------------------------------
Ran 4325 tests in 669.321s

FAILED (skips=159, errors=62, successes=4108)
while calling from thread
Traceback (most recent call last):
  File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/internet/base.py", line 1064, in runUntilCurrent
    f(*a, **kw)
  File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/internet/base.py", line 790, in stop
    raise error.ReactorNotRunning("Can't stop reactor that isn't running.")
twisted.internet.error.ReactorNotRunning: Can't stop reactor that isn't running.

joining disttrial worker #0 failed
Traceback (most recent call last):
  File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/internet/defer.py", line 1853, in _inlineCallbacks
    result = context.run(
  File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/python/failure.py", line 467, in throwExceptionIntoGenerator
    return g.throw(self.value.with_traceback(self.tb))
  File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/trial/_dist/worker.py", line 406, in exit
    await endDeferred
  File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/internet/defer.py", line 1187, in __iter__
    yield self
twisted.internet.error.ProcessTerminated: A process has ended with a probable error condition: process ended by signal 15.

With more debugging (thanks @devonh for also stepping in as maintainer), we were finding that the CI was consistently failing at test_exposed_to_prometheus which was a bit of smoke because of all of the metrics changes that were merged recently.

Locally, although I wasn't able to reproduce the bizarre errors, I could easily see increased memory usage (~20GB vs ~2GB) and the test_exposed_to_prometheus test taking a while to complete when running a full test run (SYNAPSE_TEST_LOG_LEVEL=INFO poetry run trial tests).

Lots of memory usage

After updating test_exposed_to_prometheus to dump the latest_metrics_response = generate_latest(REGISTRY), I could see that it's a massive 3.2GB response. Inspecting the contents, we can see 4.1M (4,137,123) entries for just synapse_background_update_status{server_name="test"} 3.0 which is a LaterGauge. I don't think we have 4.1M test cases so it's also unclear why we end up with so many samples but it does make sense that we do see a lot of duplicates because each HomeserverTestCase will create a homeserver for each test case that will call LaterGauge.register_hook(...) (part of the #18751 changes).

tests/storage/databases/main/test_metrics.py (simple code for dumping the metrics response)

        latest_metrics_response = generate_latest(REGISTRY)
        with open("/tmp/synapse-test-metrics", "wb") as f:
            f.write(latest_metrics_response)

After reverting the #18751 changes, running the full test suite locally doesn't result in memory spikes and seems to run normally.

Dev notes

Discussion in the #synapse-dev:matrix.org room.

Pull Request Checklist

  • Pull request is based on the develop branch
  • Pull request includes a changelog file. The entry should:
    • Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
    • Use markdown where necessary, mostly for code blocks.
    • End with either a period (.) or an exclamation mark (!).
    • Start with a capital letter.
    • Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
  • Code style is correct (run the linters)

@MadLittleMods MadLittleMods marked this pull request as ready for review August 6, 2025 22:10
@MadLittleMods MadLittleMods requested a review from a team as a code owner August 6, 2025 22:10
@devonh devonh merged commit ff03a51 into develop Aug 6, 2025
33 of 36 checks passed
@devonh devonh deleted the madlittlemods/revert-18751 branch August 6, 2025 22:14
@devonh devonh mentioned this pull request Aug 6, 2025
MadLittleMods added a commit that referenced this pull request Sep 2, 2025
…18791)

Re-introduce: #18751 that was
reverted in #18789 (explains
why the PR was reverted in the first place).

- Adds a `cleanup` pattern that cleans up metrics from each homeserver
in the tests. Previously, the list of hooks built up until our CI
machines couldn't operate properly, see
#18789
- Fix long-standing issue with `synapse_background_update_status`
metrics only tracking the last database listed in the config (see
#18791 (comment))
itsoyou pushed a commit to famedly/synapse that referenced this pull request Oct 13, 2025
…18791)

Re-introduce: element-hq/synapse#18751 that was
reverted in element-hq/synapse#18789 (explains
why the PR was reverted in the first place).

- Adds a `cleanup` pattern that cleans up metrics from each homeserver
in the tests. Previously, the list of hooks built up until our CI
machines couldn't operate properly, see
element-hq/synapse#18789
- Fix long-standing issue with `synapse_background_update_status`
metrics only tracking the last database listed in the config (see
element-hq/synapse#18791 (comment))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants