Skip to content

Conversation

@MadLittleMods
Copy link
Contributor

@MadLittleMods MadLittleMods commented Jun 26, 2025

Refactor Measure block metrics to be homeserver-scoped (add instance server_name label to block metrics).

Part of #18592

This is an alternative PR to #18591 (used homeserver-scoped CollectorRegistry) vs this PR which adds instanceserver_name labels to the metrics. See #18592 for more context.

Testing strategy

See behavior of previous metrics listener

  1. Add the metrics listener in your homeserver.yaml
    listeners:
      - port: 9323
        type: metrics
        bind_addresses: ['127.0.0.1']
  2. Start the homeserver: poetry run synapse_homeserver --config-path homeserver.yaml
  3. Fetch http://localhost:9323/metrics
  4. Observe response includes the block metrics (synapse_util_metrics_block_count, synapse_util_metrics_block_in_flight, etc)

See behavior of the http metrics resource

  1. Add the metrics resource to a new or existing http listeners in your homeserver.yaml
    listeners:
      - port: 9322
        type: http
        bind_addresses: ['127.0.0.1']
        resources:
          - names: [metrics]
            compress: false
  2. Start the homeserver: poetry run synapse_homeserver --config-path homeserver.yaml
  3. Fetch http://localhost:9322/_synapse/metrics (it's just a GET request so you can even do in the browser)
  4. Observe response includes the block metrics (synapse_util_metrics_block_count, synapse_util_metrics_block_in_flight, etc): example, example from develop

Pull Request Checklist

  • Pull request is based on the develop branch
  • Pull request includes a changelog file. The entry should:
    • Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
    • Use markdown where necessary, mostly for code blocks.
    • End with either a period (.) or an exclamation mark (!).
    • Start with a capital letter.
    • Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
  • Code style is correct (run the linters)

…round

Fix mypy complaints

```
synapse/handlers/delayed_events.py:266: error: Cannot determine type of "validator"  [has-type]
synapse/handlers/delayed_events.py:267: error: Cannot determine type of "event_builder_factory"  [has-type]
```
Comment on lines 69 to 80
INSTANCE_LABEL_NAME = "instance"
"""
The standard Prometheus label name used to identify which server instance the metrics
came from.
In the case of a Synapse homeserver, this should be set to the homeserver name
(`hs.hostname`).
Normally, this would be set automatically by the Prometheus server scraping the data but
since we support multiple instances of Synapse running in the same process and all
metrics are in a single global `REGISTRY`, we need to manually label any metrics.
"""
Copy link
Contributor Author

@MadLittleMods MadLittleMods Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only decision here is whether we want to use the standard Prometheus instance label name. Or we could possibly use server_name as the label and then use a relabel_config in Prometheus to rename server_name -> instance.

Seems better just to use the standard instance name to avoid the complication. It just slightly sucks because hs.get_instance_name() is also a thing (confusingly) when we really want to use hs.hostname as the value here.

We also to relabel self.server_name = hs.hostname in a lot of places which is probably easier to digest.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer for us to use server_name, for me instance has special meaning. I think it would conflict if we have multiple workers for the same 'instance' for example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I think I agree that instance has special meaning according to the Prometheus docs and should be "The <host>:<port> part of the target's URL that was scraped." (source) (also: "In Prometheus terms, an endpoint you can scrape is called an instance, usually corresponding to a single process.") which is different from how we're using it here. And based on following the conventions, it makes sense to use a different label.

For reference, we currently don't follow this pattern with matrix.org (see the matrix.org Prometheus scrape config using metric_relabel_configs and labels config) and hard-code the instance label to matrix.org (probably not the correct thing to do) and differentiate workers with the job label for the type of worker (correct usage) and also some index labels if we have multiple of that worker type.

Example of current metrics on matrix.org  

Notice they all use instance="matrix.org":

(source)

synapse_util_metrics_block_count_total{block_name="_calculate_state_and_extrem", environment="live", host="grindylow.matrix.org", identifier="matrix.org", index="1", instance="matrix.org", job="synapse_event_persister", service="synapse"}
synapse_util_metrics_block_count_total{block_name="_fetch_event_list", environment="live", host="doxy.matrix.org", identifier="matrix.org", index="52", instance="matrix.org", job="synapse_synchrotron", service="synapse"}
synapse_util_metrics_block_count_total{block_name="action_for_event_by_user", environment="live", host="grindylow.matrix.org", identifier="matrix.org", index="2", instance="matrix.org", job="synapse_event_creator_users", service="synapse"}

I wish Prometheus had some guidance on label names to use in these situations like OpenTelemetry has with their semantic conventions for spans and attributes. The only docs I can find are https://prometheus.io/docs/concepts/jobs_instances/ and https://prometheus.io/docs/practices/naming/ which don't mention anything beyond job and instance.

server_name sounds good to me 👍

We would see this in the logs before:
```
Failed to save metrics! Usage: <ContextResourceUsage ...> Error: Incorrect label names
```
@MadLittleMods MadLittleMods reopened this Jun 26, 2025
@MadLittleMods MadLittleMods marked this pull request as ready for review June 26, 2025 22:34
@MadLittleMods MadLittleMods requested a review from a team as a code owner June 26, 2025 22:34
Copy link
Member

@sandhose sandhose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making this very easy to review commit by commit! Just would like to see the 'server name' label changed, but other than that LGTM

Comment on lines 69 to 80
INSTANCE_LABEL_NAME = "instance"
"""
The standard Prometheus label name used to identify which server instance the metrics
came from.
In the case of a Synapse homeserver, this should be set to the homeserver name
(`hs.hostname`).
Normally, this would be set automatically by the Prometheus server scraping the data but
since we support multiple instances of Synapse running in the same process and all
metrics are in a single global `REGISTRY`, we need to manually label any metrics.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer for us to use server_name, for me instance has special meaning. I think it would conflict if we have multiple workers for the same 'instance' for example?

@MadLittleMods MadLittleMods requested a review from sandhose July 3, 2025 21:24
Copy link
Member

@sandhose sandhose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, LGTM!

@MadLittleMods MadLittleMods merged commit fc10a5e into develop Jul 15, 2025
43 checks passed
@MadLittleMods MadLittleMods deleted the madlittlemods/per-hs-metrics-measure3 branch July 15, 2025 20:55
@MadLittleMods
Copy link
Contributor Author

Thanks for the review @sandhose 🦙

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants