-
Notifications
You must be signed in to change notification settings - Fork 404
Refactor Measure block metrics to be homeserver-scoped (v2)
#18601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor Measure block metrics to be homeserver-scoped (v2)
#18601
Conversation
Update `@measure_func` docstring
…round Fix mypy complaints ``` synapse/handlers/delayed_events.py:266: error: Cannot determine type of "validator" [has-type] synapse/handlers/delayed_events.py:267: error: Cannot determine type of "event_builder_factory" [has-type] ```
synapse/metrics/__init__.py
Outdated
| INSTANCE_LABEL_NAME = "instance" | ||
| """ | ||
| The standard Prometheus label name used to identify which server instance the metrics | ||
| came from. | ||
| In the case of a Synapse homeserver, this should be set to the homeserver name | ||
| (`hs.hostname`). | ||
| Normally, this would be set automatically by the Prometheus server scraping the data but | ||
| since we support multiple instances of Synapse running in the same process and all | ||
| metrics are in a single global `REGISTRY`, we need to manually label any metrics. | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only decision here is whether we want to use the standard Prometheus instance label name. Or we could possibly use server_name as the label and then use a relabel_config in Prometheus to rename server_name -> instance.
Seems better just to use the standard instance name to avoid the complication. It just slightly sucks because hs.get_instance_name() is also a thing (confusingly) when we really want to use hs.hostname as the value here.
We also to relabel self.server_name = hs.hostname in a lot of places which is probably easier to digest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer for us to use server_name, for me instance has special meaning. I think it would conflict if we have multiple workers for the same 'instance' for example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I think I agree that instance has special meaning according to the Prometheus docs and should be "The <host>:<port> part of the target's URL that was scraped." (source) (also: "In Prometheus terms, an endpoint you can scrape is called an instance, usually corresponding to a single process.") which is different from how we're using it here. And based on following the conventions, it makes sense to use a different label.
For reference, we currently don't follow this pattern with matrix.org (see the matrix.org Prometheus scrape config using metric_relabel_configs and labels config) and hard-code the instance label to matrix.org (probably not the correct thing to do) and differentiate workers with the job label for the type of worker (correct usage) and also some index labels if we have multiple of that worker type.
Example of current metrics on matrix.org
Notice they all use instance="matrix.org":
(source)
synapse_util_metrics_block_count_total{block_name="_calculate_state_and_extrem", environment="live", host="grindylow.matrix.org", identifier="matrix.org", index="1", instance="matrix.org", job="synapse_event_persister", service="synapse"}
synapse_util_metrics_block_count_total{block_name="_fetch_event_list", environment="live", host="doxy.matrix.org", identifier="matrix.org", index="52", instance="matrix.org", job="synapse_synchrotron", service="synapse"}
synapse_util_metrics_block_count_total{block_name="action_for_event_by_user", environment="live", host="grindylow.matrix.org", identifier="matrix.org", index="2", instance="matrix.org", job="synapse_event_creator_users", service="synapse"}
I wish Prometheus had some guidance on label names to use in these situations like OpenTelemetry has with their semantic conventions for spans and attributes. The only docs I can find are https://prometheus.io/docs/concepts/jobs_instances/ and https://prometheus.io/docs/practices/naming/ which don't mention anything beyond job and instance.
server_name sounds good to me 👍
We would see this in the logs before: ``` Failed to save metrics! Usage: <ContextResourceUsage ...> Error: Incorrect label names ```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making this very easy to review commit by commit! Just would like to see the 'server name' label changed, but other than that LGTM
synapse/metrics/__init__.py
Outdated
| INSTANCE_LABEL_NAME = "instance" | ||
| """ | ||
| The standard Prometheus label name used to identify which server instance the metrics | ||
| came from. | ||
| In the case of a Synapse homeserver, this should be set to the homeserver name | ||
| (`hs.hostname`). | ||
| Normally, this would be set automatically by the Prometheus server scraping the data but | ||
| since we support multiple instances of Synapse running in the same process and all | ||
| metrics are in a single global `REGISTRY`, we need to manually label any metrics. | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer for us to use server_name, for me instance has special meaning. I think it would conflict if we have multiple workers for the same 'instance' for example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay, LGTM!
|
Thanks for the review @sandhose 🦙 |
Refactor
Measureblock metrics to be homeserver-scoped (addinstanceserver_namelabel to block metrics).Part of #18592
This is an alternative PR to #18591 (used homeserver-scoped
CollectorRegistry) vs this PR which addsinstanceserver_namelabels to the metrics. See #18592 for more context.Testing strategy
See behavior of previous
metricslistenermetricslistener in yourhomeserver.yamlpoetry run synapse_homeserver --config-path homeserver.yamlhttp://localhost:9323/metricssynapse_util_metrics_block_count,synapse_util_metrics_block_in_flight, etc)See behavior of the
httpmetricsresourcemetricsresource to a new or existinghttplisteners in yourhomeserver.yamlpoetry run synapse_homeserver --config-path homeserver.yamlhttp://localhost:9322/_synapse/metrics(it's just aGETrequest so you can even do in the browser)synapse_util_metrics_block_count,synapse_util_metrics_block_in_flight, etc): example, example fromdevelopPull Request Checklist
EventStoretoEventWorkerStore.".code blocks.