-
Couldn't load subscription status.
- Fork 405
Remove metrics listener type in favor of http listener with metrics resource
#18584
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| @@ -0,0 +1 @@ | |||
| Remove `metrics` listener (serves metrics at `/metrics`) type in favor of `http` listener with `metrics` resource (serves metrics at `/_synapse/metrics`). | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opinions on calling out the upgrade notes from the changelog entry itself? It looks like usually we prefer to call it out at the top of the changelog section for the version. Doesn't seem like it would hurt to have it in both places. The only problem being the chicken and egg problem of the upgrade notes being written later in order to have the link filled in.
| Remove `metrics` listener (serves metrics at `/metrics`) type in favor of `http` listener with `metrics` resource (serves metrics at `/_synapse/metrics`). | |
| Remove `metrics` listener (serves metrics at `/metrics`) type in favor of `http` listener with `metrics` resource (serves metrics at `/_synapse/metrics`). Please check the [relevant section in the upgrade notes](TODO). |
…etrics` is available See #18584 (comment)
| # `prometheus_client.metrics` was added in 0.5.0, so we require that too. | ||
| # We chose 0.6.0 as that is the current version in Debian Buster (oldstable). | ||
| prometheus-client = ">=0.6.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only change is here.
The rest is auto-formatting which I can remove from the diff if prompted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleted this file in favor of re-organizing the remaining pieces in more relevant file names.
| class PrometheusMetricsHackTestCase(unittest.HomeserverTestCase): | ||
| if parse_version(metadata.version("prometheus_client")) < parse_version("0.14.0"): | ||
| skip = "prometheus-client too old" | ||
|
|
||
| def test_created_metrics_disabled(self) -> None: | ||
| """ | ||
| Tests that a brittle hack, to disable `_created` metrics, works. | ||
| This involves poking at the internals of prometheus-client. | ||
| It's not the end of the world if this doesn't work. | ||
| This test gives us a way to notice if prometheus-client changes | ||
| their internals. | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this test as it claims to "gives us a way to notice if prometheus-client changes their internals." but we already have that benefit from the hack function _set_prometheus_client_use_created_metrics itself (it will yell if the _use_created attribute is no longer used).
The test provides no further benefit beside asserting that the attribute was set which feels pretty useless. A proper test would ensure that before running the utility, we see the legacy _created metrics and after we don't.
| from prometheus_client import generate_latest | ||
|
|
||
| from synapse.metrics import REGISTRY |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stray thing that I noticed. Making it consistent that we pull from synapse.metrics for generate_latest since it's something we're explicitly exporting
synapse/synapse/metrics/__init__.py
Lines 477 to 480 in 3cabaa8
| __all__ = [ | |
| "Collector", | |
| "MetricsResource", | |
| "generate_latest", |
| @@ -0,0 +1 @@ | |||
| Remove `metrics` listener (serves metrics at `/metrics`) type in favor of `http` listener with `metrics` resource (serves metrics at `/_synapse/metrics`). | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unclear to me why metrics was its own listener type before. The only difference I can see is that they are served on different paths: /metrics vs /_synapse/metrics. Feels like we just need to include this in the upgrade notes (drafted in the PR description).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that @erikjohnston pointed out is that the dedicated metrics listener uses it's own thread and doesn't rely on the Twisted reactor so it still works regardless of how well Synapse is working. It's unclear if we really care about this benefit. It does have a theoretical benefit of being able to monitor a really bogged down Synapse.
Looking at the git history, the threading aspect just seems like the way it was originally implemented and not something explicitly called out for the benefit described above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even with the updated conclusion from #18592 (comment) which most likely means we will continue to use the global REGISTRY, I think it's still worth distilling down to a single way to setup metrics; for the config clarity sake and for the codebase single source of truth sake.
(assuming we can't come up with a benefit/compelling reason for why we want both)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having thought about it a bit: I think it is worth keeping the separate metrics listener. I can't find where (and I looked), but I believe we have ported stuff to use the separate metrics listener in the past to get metrics from the process when its under extreme load (to the extent where twisted can't process new requests).
Even if we keep the separate listener, I think we should still make them render from the same source and remove any differences. If needed its easy enough to change the metrics listener to not use the inbuilt prometheus one, but instead replace it by spawning a thread and starting a basic web server that responds to /metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
separate
metricslistener in the past to get metrics from the process when its under extreme load (to the extent where twisted can't process new requests).
How sure are we that we care about this? Feels like a use case that we're just holding onto especially when we can't find any evidence.
From what I'm reading, threading.Thread (which prometheus_client.start_http_server_prometheus(...) uses), shares the same process and memory space so it doesn't even help us in the scenarios where Synapse is under extreme load.
[...] the threading module operates within a single process, meaning that all threads share the same memory space. However, the GIL limits the performance gains of threading when it comes to CPU-bound tasks, as only one thread can execute Python bytecode at a time.
-- https://docs.python.org/3/library/threading.html#gil-and-performance-considerations
Even if we keep the separate listener, I think we should still make them render from the same source and remove any differences.
I don't see an easy way forward as I don't know how to start another Twisted server in another thread in order to share the MetricsResource. It's a pretty simple resource so trying to abstract it any further isn't useful and we might as well keep the status-quo of disparate handling 🤷
I can pull out some of the clean-up from this PR to ship separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We relatively often have episodes on matrix.org where the reactor tick times are in the multiples of seconds or tens of seconds — this seems like a case where the separate metrics thread is a winner to me, so that metric exposition continues to work in this situation.
The reason why this helps, despite the GIL, is because the metrics thread can preempt the reactor thread between bytecode boundaries and the metrics thread gets scheduled with roughly equal priority to the reactor thread, whereas (afaik) there isn't an easy mechanism to do something similar between separate twisted tasks/deferreds on the same reactor (and deferreds are co-operatively scheduled so we can't actually force one to yield to another).
For illustration: if you want to disrupt metrics reporting for the reactor version, all you need to do is time.sleep(5), or open a file on a slow disk, or run a computationally-heavy loop without any yield points.
The more realistic case is likely just having lots of tasks in the reactor queue. In this case, the mental model is that the metrics request is scheduled with roughly equal priority to all the user requests (which, in some dodgy scenarios, there could be thousands of).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the extra details @reivilibre 🙂
That proves well enough that this actually works how we want and we've already expressed interest in wanting the benefit ⏩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is some prior art that I missed earlier that confirms why we care about having this:
Lines 41 to 45 in fc10a5e
| The second method runs the metrics server on a different port, in a | |
| different thread to Synapse. This can make it more resilient to | |
| heavy load meaning metrics cannot be retrieved, and can be exposed | |
| to just internal networks easier. The served metrics are available | |
| over HTTP only, and will be available at `/_synapse/metrics`. |
And traced it back to the PR that introduced this language, matrix-org/synapse#3274, which also states:
This should allow us to be resistant to Twisted melting down meaning we don't get metrics, in theory.
When I originally traced back to where the metrics listener type was introduced, I thought it was matrix-org/synapse#5636 (which doesn't have the context) but it actually goes further back to that PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to the docs we already have, I've also added the context and @reivilibre's explanation for why it works to some comments in the code itself, see #18687
Conflicts: poetry.lock
Conflicts: poetry.lock
Conflicts: poetry.lock
… code As explained by @reivilibre, #18584 (comment)
Clean up `MetricsResource`, Prometheus hacks (`_set_prometheus_client_use_created_metrics`), and better document why we care about having a separate `metrics` listener type. These clean-up changes have been split out from #18584 since that PR was closed.
Remove
metricslistener type in favor ofhttplistener withmetricsresource (which already exists).This is spawning from wanting to refactor which registry the metrics are pulled from and noticing two sources of truth. The refactor itself is spawning from running multiple Synapse in the same Python process and wanting separate metrics (see #18592)
This will require homeserver admins to update their homeserver configs in some cases. I've drafted upgrade notes for the release below.
Split out from #18443
Part of #18592
Why did we have a separate
metricslistener type?See discussion below: #18584 (comment)
Draft upgrade notes
Drafted to make it easy to add to
docs/upgrade.mdfor whoever is making the release.Upgrade notes
Upgrading to vUPCOMING
Dedicated
metricslistener removed in favor ofmetricsresource in thehttplistenerThe dedicated
type: metricslistener has been removed. Use themetricsresource in antype: httplistener instead.Example before:
homeserver.yamlExample after:
homeserver.yamlNote: The endpoint path changes from
/metricsto/_synapse/metrics- update your metrics scraper (e.g., Prometheus) accordingly.See the Metrics How-to page for more information on how to setup and configure metrics.
Testing strategy
See behavior of previous
metricslistenermetricslistener in yourhomeserver.yamlpoetry run synapse_homeserver --config-path homeserver.yamlhttp://localhost:9323/metricsSee behavior of the
httpmetricsresourcemetricsresource to a new or existinghttplisteners in yourhomeserver.yamlpoetry run synapse_homeserver --config-path homeserver.yamlhttp://localhost:9322/_synapse/metrics(it's just aGETrequest so you can even do in the browser)developThere is no difference in the response from this PR to
developor from this resource to themetricslistener being removed.Dev notes
metricslistener useslisten_metricshttplistener withmetricsresource usesMetricsResourceRelated PRs:
_set_prometheus_client_use_created_metrics_set_prometheus_client_use_created_metrics()was added in matrix-org/synapse#13540 (2022-08-16)But now this functionality has dedicated functions in
prometheus_clientviadisable_created_metrics()/enable_created_metrics()which was added in prometheus/client_python#973 (2023-10-26). Part of[email protected](2023-10-30).Our deprecation policy states that we support whatever is provided by Debian oldstable. Which for
python-prometheus-clientis currently0.6.0-1(too old).Todo
_set_prometheus_client_use_created_metricsPull Request Checklist
EventStoretoEventWorkerStore.".code blocks.