Remove `metrics` listener type in favor of `http` listener with `metrics` resource #18584

MadLittleMods · 2025-06-20T16:57:46Z

Remove metrics listener type in favor of http listener with metrics resource (which already exists).

This is spawning from wanting to refactor which registry the metrics are pulled from and noticing two sources of truth. The refactor itself is spawning from running multiple Synapse in the same Python process and wanting separate metrics (see #18592)

This will require homeserver admins to update their homeserver configs in some cases. I've drafted upgrade notes for the release below.

Split out from #18443

Part of #18592

Why did we have a separate `metrics` listener type?

See discussion below: #18584 (comment)

Draft upgrade notes

Drafted to make it easy to add to docs/upgrade.md for whoever is making the release.

Upgrade notes

Upgrading to vUPCOMING

Dedicated `metrics` listener removed in favor of `metrics` resource in the `http` listener

The dedicated type: metrics listener has been removed. Use the metrics resource in an type: http listener instead.

Example before:

homeserver.yaml

listeners:
  - port: 9000
    type: metrics
    bind_addresses: ['::1', '127.0.0.1']

Example after:

homeserver.yaml

listeners:
  - port: 9000
    type: http
    bind_addresses: ['::1', '127.0.0.1']
    resources:
      - names: [metrics]
        compress: false

Note: The endpoint path changes from /metrics to /_synapse/metrics - update your metrics scraper (e.g., Prometheus) accordingly.

See the Metrics How-to page for more information on how to setup and configure metrics.

Testing strategy

See behavior of previous `metrics` listener

Add the metrics listener in your homeserver.yaml

listeners:
  - port: 9323
    type: metrics
    bind_addresses: ['127.0.0.1']

Start the homeserver: poetry run synapse_homeserver --config-path homeserver.yaml
Fetch http://localhost:9323/metrics
Observe response: example from this PR

See behavior of the `http` `metrics` resource

Add the metrics resource to a new or existing http listeners in your homeserver.yaml

listeners:
  - port: 9322
    type: http
    bind_addresses: ['127.0.0.1']
    resources:
      - names: [metrics]
        compress: false

Start the homeserver: poetry run synapse_homeserver --config-path homeserver.yaml
Fetch http://localhost:9322/_synapse/metrics (it's just a GET request so you can even do in the browser)
Observe response: example, example from develop

There is no difference in the response from this PR to develop or from this resource to the metrics listener being removed.

Dev notes

metrics listener uses listen_metrics
http listener with metrics resource uses MetricsResource

Related PRs:

`_set_prometheus_client_use_created_metrics`

_set_prometheus_client_use_created_metrics() was added in matrix-org/synapse#13540 (2022-08-16)

But now this functionality has dedicated functions inprometheus_client via disable_created_metrics()/enable_created_metrics() which was added in prometheus/client_python#973 (2023-10-26). Part of [email protected] (2023-10-30).

Our deprecation policy states that we support whatever is provided by Debian oldstable. Which for python-prometheus-client is currently 0.6.0-1 (too old).

Todo

Figure out _set_prometheus_client_use_created_metrics

Pull Request Checklist

Pull request is based on the develop branch
Pull request includes a changelog file. The entry should:
- Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
- Use markdown where necessary, mostly for code blocks.
- End with either a period (.) or an exclamation mark (!).
- Start with a capital letter.
- Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
Code style is correct (run the linters)

…ics` resource

Added in prometheus/client_python#973

synapse/app/_base.py

…was removed)

MadLittleMods · 2025-06-20T20:46:30Z

changelog.d/18584.removal

@@ -0,0 +1 @@
+Remove `metrics` listener (serves metrics at `/metrics`) type in favor of `http` listener with `metrics` resource (serves metrics at `/_synapse/metrics`).


Opinions on calling out the upgrade notes from the changelog entry itself? It looks like usually we prefer to call it out at the top of the changelog section for the version. Doesn't seem like it would hurt to have it in both places. The only problem being the chicken and egg problem of the upgrade notes being written later in order to have the link filled in.

Suggested change

Remove `metrics` listener (serves metrics at `/metrics`) type in favor of `http` listener with `metrics` resource (serves metrics at `/_synapse/metrics`).

Remove `metrics` listener (serves metrics at `/metrics`) type in favor of `http` listener with `metrics` resource (serves metrics at `/_synapse/metrics`). Please check the [relevant section in the upgrade notes](TODO).

See #18584 (comment)

synapse/metrics/_twisted_exposition.py

…etrics` is available See #18584 (comment)

MadLittleMods · 2025-06-23T16:17:32Z

pyproject.toml

+# `prometheus_client.metrics` was added in 0.5.0, so we require that too.
+# We chose 0.6.0 as that is the current version in Debian Buster (oldstable).
+prometheus-client = ">=0.6.0"


The only change is here.

The rest is auto-formatting which I can remove from the diff if prompted.

… does

MadLittleMods · 2025-06-23T16:56:44Z

synapse/metrics/_twisted_exposition.py

Deleted this file in favor of re-organizing the remaining pieces in more relevant file names.

MadLittleMods · 2025-06-23T16:58:10Z

tests/metrics/test_metrics.py

-class PrometheusMetricsHackTestCase(unittest.HomeserverTestCase):
-    if parse_version(metadata.version("prometheus_client")) < parse_version("0.14.0"):
-        skip = "prometheus-client too old"
-
-    def test_created_metrics_disabled(self) -> None:
-        """
-        Tests that a brittle hack, to disable `_created` metrics, works.
-        This involves poking at the internals of prometheus-client.
-        It's not the end of the world if this doesn't work.
-
-        This test gives us a way to notice if prometheus-client changes
-        their internals.
-        """


Removed this test as it claims to "gives us a way to notice if prometheus-client changes their internals." but we already have that benefit from the hack function _set_prometheus_client_use_created_metrics itself (it will yell if the _use_created attribute is no longer used).

The test provides no further benefit beside asserting that the attribute was set which feels pretty useless. A proper test would ensure that before running the utility, we see the legacy _created metrics and after we don't.

MadLittleMods · 2025-06-23T16:59:58Z

tests/storage/test_event_metrics.py

-from prometheus_client import generate_latest

-from synapse.metrics import REGISTRY


Stray thing that I noticed. Making it consistent that we pull from synapse.metrics for generate_latest since it's something we're explicitly exporting

synapse/synapse/metrics/__init__.py

Lines 477 to 480 in 3cabaa8

__all__ = [

"Collector",

"MetricsResource",

"generate_latest",

MadLittleMods · 2025-06-23T17:17:08Z

changelog.d/18584.removal

@@ -0,0 +1 @@
+Remove `metrics` listener (serves metrics at `/metrics`) type in favor of `http` listener with `metrics` resource (serves metrics at `/_synapse/metrics`).


It's unclear to me why metrics was its own listener type before. The only difference I can see is that they are served on different paths: /metrics vs /_synapse/metrics. Feels like we just need to include this in the upgrade notes (drafted in the PR description).

One thing that @erikjohnston pointed out is that the dedicated metrics listener uses it's own thread and doesn't rely on the Twisted reactor so it still works regardless of how well Synapse is working. It's unclear if we really care about this benefit. It does have a theoretical benefit of being able to monitor a really bogged down Synapse.

Looking at the git history, the threading aspect just seems like the way it was originally implemented and not something explicitly called out for the benefit described above.

Even with the updated conclusion from #18592 (comment) which most likely means we will continue to use the global REGISTRY, I think it's still worth distilling down to a single way to setup metrics; for the config clarity sake and for the codebase single source of truth sake.

(assuming we can't come up with a benefit/compelling reason for why we want both)

Having thought about it a bit: I think it is worth keeping the separate metrics listener. I can't find where (and I looked), but I believe we have ported stuff to use the separate metrics listener in the past to get metrics from the process when its under extreme load (to the extent where twisted can't process new requests).

Even if we keep the separate listener, I think we should still make them render from the same source and remove any differences. If needed its easy enough to change the metrics listener to not use the inbuilt prometheus one, but instead replace it by spawning a thread and starting a basic web server that responds to /metrics.

separate metrics listener in the past to get metrics from the process when its under extreme load (to the extent where twisted can't process new requests).

How sure are we that we care about this? Feels like a use case that we're just holding onto especially when we can't find any evidence.

From what I'm reading, threading.Thread (which prometheus_client.start_http_server_prometheus(...) uses), shares the same process and memory space so it doesn't even help us in the scenarios where Synapse is under extreme load.

[...] the threading module operates within a single process, meaning that all threads share the same memory space. However, the GIL limits the performance gains of threading when it comes to CPU-bound tasks, as only one thread can execute Python bytecode at a time.

-- https://docs.python.org/3/library/threading.html#gil-and-performance-considerations

Even if we keep the separate listener, I think we should still make them render from the same source and remove any differences.

I don't see an easy way forward as I don't know how to start another Twisted server in another thread in order to share the MetricsResource. It's a pretty simple resource so trying to abstract it any further isn't useful and we might as well keep the status-quo of disparate handling 🤷

I can pull out some of the clean-up from this PR to ship separately.

We relatively often have episodes on matrix.org where the reactor tick times are in the multiples of seconds or tens of seconds — this seems like a case where the separate metrics thread is a winner to me, so that metric exposition continues to work in this situation.

The reason why this helps, despite the GIL, is because the metrics thread can preempt the reactor thread between bytecode boundaries and the metrics thread gets scheduled with roughly equal priority to the reactor thread, whereas (afaik) there isn't an easy mechanism to do something similar between separate twisted tasks/deferreds on the same reactor (and deferreds are co-operatively scheduled so we can't actually force one to yield to another).

For illustration: if you want to disrupt metrics reporting for the reactor version, all you need to do is time.sleep(5), or open a file on a slow disk, or run a computationally-heavy loop without any yield points.

The more realistic case is likely just having lots of tasks in the reactor queue. In this case, the mental model is that the metrics request is scheduled with roughly equal priority to all the user requests (which, in some dodgy scenarios, there could be thousands of).

Thanks for the extra details @reivilibre 🙂

That proves well enough that this actually works how we want and we've already expressed interest in wanting the benefit ⏩

Here is some prior art that I missed earlier that confirms why we care about having this:

synapse/docs/metrics-howto.md

Lines 41 to 45 in fc10a5e

The second method runs the metrics server on a different port, in a

different thread to Synapse. This can make it more resilient to

heavy load meaning metrics cannot be retrieved, and can be exposed

to just internal networks easier. The served metrics are available

over HTTP only, and will be available at `/_synapse/metrics`.

And traced it back to the PR that introduced this language, matrix-org/synapse#3274, which also states:

This should allow us to be resistant to Twisted melting down meaning we don't get metrics, in theory.

When I originally traced back to where the metrics listener type was introduced, I thought it was matrix-org/synapse#5636 (which doesn't have the context) but it actually goes further back to that PR.

In addition to the docs we already have, I've also added the context and @reivilibre's explanation for why it works to some comments in the code itself, see #18687

Conflicts: poetry.lock

@reivilibre

… code As explained by @reivilibre, #18584 (comment)

Clean up `MetricsResource`, Prometheus hacks (`_set_prometheus_client_use_created_metrics`), and better document why we care about having a separate `metrics` listener type. These clean-up changes have been split out from #18584 since that PR was closed.

Remove metrics listener type in favor of http listener with `metr…

bb31f60

…ics` resource

MadLittleMods added the A-Metrics label Jun 20, 2025

github-actions bot deployed to PR Documentation Preview June 20, 2025 16:58 Active

Disable _created metrics via dedicated built-in function

1fe3fa4

Added in prometheus/client_python#973

github-actions bot deployed to PR Documentation Preview June 20, 2025 20:11 Active

MadLittleMods commented Jun 20, 2025

View reviewed changes

synapse/app/_base.py Show resolved Hide resolved

MadLittleMods added 2 commits June 20, 2025 15:13

Fix lints

6e47458

Remove stray file

3e16b88

github-actions bot deployed to PR Documentation Preview June 20, 2025 20:15 Active

MadLittleMods added 3 commits June 20, 2025 15:37

Add changelog

3b05b5e

Remove tests for _set_prometheus_client_use_created_metrics (which …

54b836c

…was removed)

Fix lints

8ccd52c

github-actions bot deployed to PR Documentation Preview June 20, 2025 20:40 Active

MadLittleMods commented Jun 20, 2025

View reviewed changes

MadLittleMods added 2 commits June 20, 2025 16:15

Restore _set_prometheus_client_use_created_metrics

bb3904c

See #18584 (comment)

Update with new state of reality

adaf5b0

github-actions bot deployed to PR Documentation Preview June 20, 2025 21:19 Active

Fix lints

cfdacb8

github-actions bot deployed to PR Documentation Preview June 20, 2025 21:42 Active

Restore _set_prometheus_client_use_created_metrics tests

16e87ed

MadLittleMods commented Jun 20, 2025

View reviewed changes

synapse/metrics/_twisted_exposition.py Outdated Show resolved Hide resolved

Update minimum version of prometheus-client so `prometheus_client.m…

07c6806

…etrics` is available See #18584 (comment)

github-actions bot deployed to PR Documentation Preview June 20, 2025 22:02 Active

MadLittleMods commented Jun 23, 2025

View reviewed changes

MadLittleMods added 4 commits June 23, 2025 11:53

Remove tests which don't assert anything beyond what the util already…

d060446

… does

Re-organize metrics resource

04084ae

Make broken hack more obvious when it happens

b849bed

Fix lints

6c662e0

MadLittleMods commented Jun 23, 2025

View reviewed changes

Remove extra newline

03ae604

MadLittleMods commented Jun 23, 2025

View reviewed changes

github-actions bot deployed to PR Documentation Preview June 23, 2025 16:58 Active

Log an error for old versions

1ce74d4

github-actions bot deployed to PR Documentation Preview June 23, 2025 17:16 Active

MadLittleMods commented Jun 23, 2025

View reviewed changes

Merge branch 'develop' into madlittlemods/remove-metrics-listener

467084d

github-actions bot deployed to PR Documentation Preview June 23, 2025 17:52 Active

MadLittleMods marked this pull request as ready for review June 23, 2025 19:26

MadLittleMods requested a review from a team as a code owner June 23, 2025 19:26

MadLittleMods mentioned this pull request Jun 23, 2025

Use homeserver specific Prometheus CollectorRegistry for metrics instead of global REGISTRY #18443

Closed

24 tasks

Merge branch 'develop' into madlittlemods/remove-metrics-listener

25585b4

Conflicts: poetry.lock

github-actions bot deployed to PR Documentation Preview June 24, 2025 18:25 Active

MadLittleMods mentioned this pull request Jun 24, 2025

Refactor Measure block metrics to be homeserver-scoped #18591

Closed

4 tasks

Merge branch 'develop' into madlittlemods/remove-metrics-listener

95eab21

github-actions bot deployed to PR Documentation Preview June 26, 2025 22:36 Active

MadLittleMods requested a review from erikjohnston June 30, 2025 20:22

Merge branch 'develop' into madlittlemods/remove-metrics-listener

9d765a3

Conflicts: poetry.lock

github-actions bot deployed to PR Documentation Preview July 3, 2025 21:29 Active

Merge branch 'develop' into madlittlemods/remove-metrics-listener

d31fab7

Conflicts: poetry.lock

github-actions bot deployed to PR Documentation Preview July 4, 2025 15:32 Active

MadLittleMods closed this Jul 15, 2025

MadLittleMods mentioned this pull request Jul 15, 2025

Clean up MetricsResource and Prometheus hacks #18687

Merged

3 tasks

MadLittleMods added a commit that referenced this pull request Jul 15, 2025

Add context for why we have a separate metrics listener type in the…

3294465

… code As explained by @reivilibre, #18584 (comment)

		@@ -0,0 +1 @@
		Remove `metrics` listener (serves metrics at `/metrics`) type in favor of `http` listener with `metrics` resource (serves metrics at `/_synapse/metrics`).

		from prometheus_client import generate_latest

		from synapse.metrics import REGISTRY

	The second method runs the metrics server on a different port, in a
	different thread to Synapse. This can make it more resilient to
	heavy load meaning metrics cannot be retrieved, and can be exposed
	to just internal networks easier. The served metrics are available
	over HTTP only, and will be available at `/_synapse/metrics`.

Uh oh!

Remove metrics listener type in favor of http listener with metrics resource #18584

Remove metrics listener type in favor of http listener with metrics resource #18584

Uh oh!

Conversation

MadLittleMods commented Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why did we have a separate metrics listener type?

Draft upgrade notes

Upgrading to vUPCOMING

Dedicated metrics listener removed in favor of metrics resource in the http listener

Testing strategy

See behavior of previous metrics listener

See behavior of the http metrics resource

Dev notes

_set_prometheus_client_use_created_metrics

Todo

Pull Request Checklist

Uh oh!

Uh oh!

MadLittleMods Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MadLittleMods Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MadLittleMods Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MadLittleMods Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MadLittleMods Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Remove `metrics` listener type in favor of `http` listener with `metrics` resource #18584

Remove `metrics` listener type in favor of `http` listener with `metrics` resource #18584

MadLittleMods commented Jun 20, 2025 •

edited

Loading

Why did we have a separate `metrics` listener type?

Dedicated `metrics` listener removed in favor of `metrics` resource in the `http` listener

See behavior of previous `metrics` listener

See behavior of the `http` `metrics` resource

`_set_prometheus_client_use_created_metrics`

MadLittleMods Jun 20, 2025 •

edited

Loading

MadLittleMods Jun 23, 2025 •

edited

Loading

MadLittleMods Jun 25, 2025 •

edited

Loading

MadLittleMods Jun 26, 2025 •

edited

Loading

MadLittleMods Jul 15, 2025 •

edited

Loading