Revert "Fix LaterGauge metrics to collect from all servers (#18751)"
#18789
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR reverts #18751
Why revert?
@reivilibre found that our CI was failing in bizarre ways (thanks for stepping up to dive into this 🙇). Examples:
twisted.internet.error.ProcessTerminated: A process has ended with a probable error condition: process ended by signal 9.twisted.internet.error.ProcessTerminated: A process has ended with a probable error condition: process ended by signal 15.More detailed part of the log
https://github.com/element-hq/synapse/actions/runs/16758038107/job/47500520633#step:9:6809
With more debugging (thanks @devonh for also stepping in as maintainer), we were finding that the CI was consistently failing at
test_exposed_to_prometheuswhich was a bit of smoke because of all of the metrics changes that were merged recently.Locally, although I wasn't able to reproduce the bizarre errors, I could easily see increased memory usage (~20GB vs ~2GB) and the
test_exposed_to_prometheustest taking a while to complete when running a full test run (SYNAPSE_TEST_LOG_LEVEL=INFO poetry run trial tests).After updating
test_exposed_to_prometheusto dump thelatest_metrics_response = generate_latest(REGISTRY), I could see that it's a massive 3.2GB response. Inspecting the contents, we can see 4.1M (4,137,123) entries for justsynapse_background_update_status{server_name="test"} 3.0which is aLaterGauge. I don't think we have 4.1M test cases so it's also unclear why we end up with so many samples but it does make sense that we do see a lot of duplicates because eachHomeserverTestCasewill create a homeserver for each test case that will callLaterGauge.register_hook(...)(part of the #18751 changes).tests/storage/databases/main/test_metrics.py(simple code for dumping the metrics response)After reverting the #18751 changes, running the full test suite locally doesn't result in memory spikes and seems to run normally.
Dev notes
Discussion in the
#synapse-dev:matrix.orgroom.Pull Request Checklist
EventStoretoEventWorkerStore.".code blocks.