Refactor `Measure` block metrics to be homeserver-scoped (v2) #18601

MadLittleMods · 2025-06-26T21:13:15Z

Refactor Measure block metrics to be homeserver-scoped (add ~~instance~~ server_name label to block metrics).

This is an alternative PR to #18591 (used homeserver-scoped CollectorRegistry) vs this PR which adds ~~instance~~server_name labels to the metrics. See #18592 for more context.

Testing strategy

See behavior of previous `metrics` listener

Add the metrics listener in your homeserver.yaml

listeners:
  - port: 9323
    type: metrics
    bind_addresses: ['127.0.0.1']

Start the homeserver: poetry run synapse_homeserver --config-path homeserver.yaml
Fetch http://localhost:9323/metrics
Observe response includes the block metrics (synapse_util_metrics_block_count, synapse_util_metrics_block_in_flight, etc)

See behavior of the `http` `metrics` resource

Add the metrics resource to a new or existing http listeners in your homeserver.yaml

listeners:
  - port: 9322
    type: http
    bind_addresses: ['127.0.0.1']
    resources:
      - names: [metrics]
        compress: false

Start the homeserver: poetry run synapse_homeserver --config-path homeserver.yaml
Fetch http://localhost:9322/_synapse/metrics (it's just a GET request so you can even do in the browser)
Observe response includes the block metrics (synapse_util_metrics_block_count, synapse_util_metrics_block_in_flight, etc): example, example from develop

Pull Request Checklist

Pull request is based on the develop branch
Pull request includes a changelog file. The entry should:
- Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
- Use markdown where necessary, mostly for code blocks.
- End with either a period (.) or an exclamation mark (!).
- Start with a capital letter.
- Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
Code style is correct (run the linters)

Update `@measure_func` docstring

…round Fix mypy complaints ``` synapse/handlers/delayed_events.py:266: error: Cannot determine type of "validator" [has-type] synapse/handlers/delayed_events.py:267: error: Cannot determine type of "event_builder_factory" [has-type] ```

MadLittleMods · 2025-06-26T21:24:58Z

synapse/metrics/__init__.py

+INSTANCE_LABEL_NAME = "instance"
+"""
+The standard Prometheus label name used to identify which server instance the metrics
+came from.
+
+In the case of a Synapse homeserver, this should be set to the homeserver name
+(`hs.hostname`).
+
+Normally, this would be set automatically by the Prometheus server scraping the data but
+since we support multiple instances of Synapse running in the same process and all
+metrics are in a single global `REGISTRY`, we need to manually label any metrics.
+"""


The only decision here is whether we want to use the standard Prometheus instance label name. Or we could possibly use server_name as the label and then use a relabel_config in Prometheus to rename server_name -> instance.

Seems better just to use the standard instance name to avoid the complication. It just slightly sucks because hs.get_instance_name() is also a thing (confusingly) when we really want to use hs.hostname as the value here.

We also to relabel self.server_name = hs.hostname in a lot of places which is probably easier to digest.

I'd prefer for us to use server_name, for me instance has special meaning. I think it would conflict if we have multiple workers for the same 'instance' for example?

👍 I think I agree that instance has special meaning according to the Prometheus docs and should be "The <host>:<port> part of the target's URL that was scraped." (source) (also: "In Prometheus terms, an endpoint you can scrape is called an instance, usually corresponding to a single process.") which is different from how we're using it here. And based on following the conventions, it makes sense to use a different label.

For reference, we currently don't follow this pattern with matrix.org (see the matrix.org Prometheus scrape config using metric_relabel_configs and labels config) and hard-code the instance label to matrix.org (probably not the correct thing to do) and differentiate workers with the job label for the type of worker (correct usage) and also some index labels if we have multiple of that worker type.

Example of current metrics on matrix.org

Notice they all use instance="matrix.org":

(source)

synapse_util_metrics_block_count_total{block_name="_calculate_state_and_extrem", environment="live", host="grindylow.matrix.org", identifier="matrix.org", index="1", instance="matrix.org", job="synapse_event_persister", service="synapse"} synapse_util_metrics_block_count_total{block_name="_fetch_event_list", environment="live", host="doxy.matrix.org", identifier="matrix.org", index="52", instance="matrix.org", job="synapse_synchrotron", service="synapse"} synapse_util_metrics_block_count_total{block_name="action_for_event_by_user", environment="live", host="grindylow.matrix.org", identifier="matrix.org", index="2", instance="matrix.org", job="synapse_event_creator_users", service="synapse"}

I wish Prometheus had some guidance on label names to use in these situations like OpenTelemetry has with their semantic conventions for spans and attributes. The only docs I can find are https://prometheus.io/docs/concepts/jobs_instances/ and https://prometheus.io/docs/practices/naming/ which don't mention anything beyond job and instance.

server_name sounds good to me 👍

We would see this in the logs before: ``` Failed to save metrics! Usage: <ContextResourceUsage ...> Error: Incorrect label names ```

sandhose

Thanks for making this very easy to review commit by commit! Just would like to see the 'server name' label changed, but other than that LGTM

sandhose · 2025-07-02T16:41:02Z

synapse/metrics/__init__.py

+INSTANCE_LABEL_NAME = "instance"
+"""
+The standard Prometheus label name used to identify which server instance the metrics
+came from.
+
+In the case of a Synapse homeserver, this should be set to the homeserver name
+(`hs.hostname`).
+
+Normally, this would be set automatically by the Prometheus server scraping the data but
+since we support multiple instances of Synapse running in the same process and all
+metrics are in a single global `REGISTRY`, we need to manually label any metrics.
+"""


I'd prefer for us to use server_name, for me instance has special meaning. I think it would conflict if we have multiple workers for the same 'instance' for example?

synapse/util/metrics.py

See #18601 (comment)

sandhose

Sorry for the delay, LGTM!

MadLittleMods · 2025-07-15T20:55:46Z

Thanks for the review @sandhose 🦙

MadLittleMods added 6 commits June 26, 2025 15:53

Add instance label to Measure

02a7668

Refactor @measure_func decorator to include server name

65035b6

Update `@measure_func` docstring

Bulk refactor Measure(...) to add server_name

d05b6ca

Refactor Measure in WellKnownResolver

6731c4b

Bulk refactor @measure_func decorator usage

c7d15db

MadLittleMods added the A-Metrics label Jun 26, 2025

Add changelog

521c68c

MadLittleMods mentioned this pull request Jun 26, 2025

Refactor Measure block metrics to be homeserver-scoped #18591

Closed

4 tasks

MadLittleMods added 3 commits June 26, 2025 16:16

Better docstrings for _InFlightMetric -> _BlockInFlightMetric

652c34b

Add docstrings for block metrics

5ad555c

Add introduction comment

06f9af1

MadLittleMods commented Jun 26, 2025

View reviewed changes

MadLittleMods mentioned this pull request Jun 26, 2025

Refactor metrics to be scoped to the homeserver #18592

Closed

38 tasks

Fix failing to save metrics because of incorrect label names

e0f8992

We would see this in the logs before: ``` Failed to save metrics! Usage: <ContextResourceUsage ...> Error: Incorrect label names ```

MadLittleMods closed this Jun 26, 2025

MadLittleMods reopened this Jun 26, 2025

MadLittleMods marked this pull request as ready for review June 26, 2025 22:34

MadLittleMods requested a review from a team as a code owner June 26, 2025 22:34

MadLittleMods mentioned this pull request Jul 1, 2025

Refactor cache metrics to be homeserver-scoped #18604

Merged

5 tasks

sandhose requested changes Jul 2, 2025

View reviewed changes

MadLittleMods added 3 commits July 3, 2025 16:17

Add metric docstring to metric description/documentation

1cb0f77

See #18601 (comment)

instance -> server_name label

85293a0

See #18601 (comment)

Merge branch 'develop' into madlittlemods/per-hs-metrics-measure3

9ef4607

MadLittleMods requested a review from sandhose July 3, 2025 21:24

Merge branch 'develop' into madlittlemods/per-hs-metrics-measure3

c4abc83

MadLittleMods mentioned this pull request Jul 4, 2025

Refactor Counter metrics to be homeserver-scoped #18656

Merged

7 tasks

Merge branch 'develop' into madlittlemods/per-hs-metrics-measure3

820a8ef

sandhose approved these changes Jul 15, 2025

View reviewed changes

Merge branch 'develop' into madlittlemods/per-hs-metrics-measure3

76fbcb7

Merge branch 'develop' into madlittlemods/per-hs-metrics-measure3

db172b6

MadLittleMods merged commit fc10a5e into develop Jul 15, 2025
43 checks passed

MadLittleMods deleted the madlittlemods/per-hs-metrics-measure3 branch July 15, 2025 20:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor `Measure` block metrics to be homeserver-scoped (v2) #18601

Refactor `Measure` block metrics to be homeserver-scoped (v2) #18601

Uh oh!

MadLittleMods commented Jun 26, 2025 •

edited

Loading

Uh oh!

MadLittleMods Jun 26, 2025 •

edited

Loading

Uh oh!

sandhose Jul 2, 2025

Uh oh!

MadLittleMods Jul 2, 2025

Uh oh!

sandhose left a comment

Uh oh!

sandhose Jul 2, 2025

Uh oh!

Uh oh!

sandhose left a comment

Uh oh!

Uh oh!

MadLittleMods commented Jul 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Refactor Measure block metrics to be homeserver-scoped (v2) #18601

Refactor Measure block metrics to be homeserver-scoped (v2) #18601

Uh oh!

Conversation

MadLittleMods commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing strategy

See behavior of previous metrics listener

See behavior of the http metrics resource

Pull Request Checklist

Uh oh!

MadLittleMods Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sandhose Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

MadLittleMods Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

sandhose left a comment

Choose a reason for hiding this comment

Uh oh!

sandhose Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sandhose left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MadLittleMods commented Jul 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Refactor `Measure` block metrics to be homeserver-scoped (v2) #18601

Refactor `Measure` block metrics to be homeserver-scoped (v2) #18601

MadLittleMods commented Jun 26, 2025 •

edited

Loading

See behavior of previous `metrics` listener

See behavior of the `http` `metrics` resource

MadLittleMods Jun 26, 2025 •

edited

Loading