Add Prometheus timers to the subsystems#1923
Conversation
|
AFAIK prometheus is not really done for performance metrics? https://github.com/tokio-rs/tracing probably is something that is better suited for this? |
|
I see prometheus and tracing as answering different questions. Prometheus answers the question, "What is taking a while"; it lets us collect aggregated statistics about the performance characteristics of our async functions. Tracing answers the question, "What is the sequence of events that is happening"? Both can be useful--in particular, tracing is a lot less worried about data cardinality, and so encourages us to tag events by i.e. the relay parent hash to which they're attached--but given that we already have prometheus set up and integrated, I'm going to to finish this up, then work on tracing as a separate task. |
|
Note that all metrics currently use |
ordian
left a comment
There was a problem hiding this comment.
Agree with Basti that for identifying bottlenecks tracing-timing fits better.
OTOH, having timing information on a live chain in combination with alerts might be helpful.
Curious to know what @mxinden thinks about this.
If we decide to keep the metrics, we should reduce the amount to the known slowest offenders (not all of them are equally useful) and maybe use a specific suffix for timing metrics.
Is the motivation for this pull request a currently existing CPU bottleneck, or the general goal of catching such bottlenecks quickly in the future? As far as I can tell the observation timespan of some of these include That said, I don't think these metrics are particularly costly. While histograms are more expensive than counters, this pull request does not use any |
|
Thanks for your input @mxinden!
The latter, having metrics on a live chain will help us identify a problem more quickly.
That's a good point. But observation timespans will give us an understanding how long e.g. certain requests take time end-to-end. But I agree that sampling profilers like |
Instead, get these values with ```sh target/release/adder-collator export-genesis-state target/release/adder-collator export-genesis-wasm ``` And then register the parachain on https://polkadot.js.org/apps/?rpc=ws%3A%2F%2F127.0.0.1%3A9944#/explorer To collect prometheus data, after running the script, create `prometheus.yml` per the instructions at https://www.notion.so/paritytechnologies/Setting-up-Prometheus-locally-835cb3a9df7541a781c381006252b5ff and then run: ```sh docker run -v `pwd`/prometheus.yml:/etc/prometheus/prometheus.yml:z --network host prom/prometheus ``` Demonstrates that data makes it across to prometheus, though it is likely to be useful in the future to tweak the buckets.
| @@ -0,0 +1,191 @@ | |||
| #!/usr/bin/env bash | |||
There was a problem hiding this comment.
Should we consider writing some rust harness code and verifying the prometheus events? I would prefer that over a complex bash script.
There was a problem hiding this comment.
That's not a bad idea, but unless there's library support for submitting transactions to replace polkadot-js, it doesn't really get us much. Bash is an ugly language, but it's hard to beat it when your task is fundamentally to coordinate launching a bunch of stuff on the command line.
That said, if there's library support in Rust so it's possible to automate the registration, then that's a game-changer. I just don't know of it, if it exists.
There was a problem hiding this comment.
I think for now this is ok. Created a tracking issue #1991
There was a problem hiding this comment.
We support sending transactions from rust tests, there is no need to use polkadot-js for this. Do you even took a look at the integration test of the collator?
There was a problem hiding this comment.
Yes, but my intent was to run them standalone for debugging purposes, not as part of the test suite. It's easier to reach for scripts/adder-collator.sh than cargo test -p whatever-module -- --nocapture --no-timeout, or whatever the proper incantation ends up being.
There was a problem hiding this comment.
I don't really agree with that, usually it should always be run with the the test suite, not only by some subset of devs that are aware of its existance.
The approach of adding shell scripts imho also doesn't scale well beyond 3.
ordian
left a comment
There was a problem hiding this comment.
Could you show us a screenshot of how histogram metrics look like and resolve merge conflicts please?
Co-authored-by: Andronik Ordian <write@reusable.software>
Prometheus doesn't render the histogram metrics, unfortunately; that's left to integration with Grafana. I verified that data is copied into Prometheus, but haven't bothered setting up Grafana for demo purposes: it felt like something which would take a fair amount of time (mostly because I'm unfamiliar with Grafana) without actually showing anything new; prometheus and grafana are well-known for working together. What you can do right now is query the tabular data in prometheus by querying the |
|
bot merge |
|
Missing process info; check that the PR belongs to a project column. Merge can be attempted if:
See https://github.com/paritytech/parity-processbot#faq |
This should assist in determining where the subsystems spend their time, which should help isolate performance issues.
Note: all these timers are cumulative timers: they include time spent in other parts of the stack. They observe automatically on drop.
Timers are implemented for:
availability_distributionavailability_storebitfield_distributionbitfield_signingcandidate_backingcandidate_selectioncandidate_validationchain_apicollation_generationcollator_protocolDoes not have any existing metrics; presumed to be uninteresting.network_bridgepov_distributionprovisionerruntime_apistatement_distributionTODO: