Add Prometheus timers to the subsystems by coriolinus · Pull Request #1923 · paritytech/polkadot

coriolinus · 2020-11-05T16:35:28Z

This should assist in determining where the subsystems spend their time, which should help isolate performance issues.

Note: all these timers are cumulative timers: they include time spent in other parts of the stack. They observe automatically on drop.

Timers are implemented for:

TODO:

Run a local net with a few nodes; collect up the statistics in local grafana; demo the output

bkchr · 2020-11-05T20:11:29Z

AFAIK prometheus is not really done for performance metrics?

https://github.com/tokio-rs/tracing probably is something that is better suited for this?

coriolinus · 2020-11-06T08:23:36Z

I see prometheus and tracing as answering different questions. Prometheus answers the question, "What is taking a while"; it lets us collect aggregated statistics about the performance characteristics of our async functions. Tracing answers the question, "What is the sequence of events that is happening"?

Both can be useful--in particular, tracing is a lot less worried about data cardinality, and so encourages us to tag events by i.e. the relay parent hash to which they're attached--but given that we already have prometheus set up and integrated, I'm going to to finish this up, then work on tracing as a separate task.

coriolinus · 2020-11-09T10:34:22Z

Note that all metrics currently use prometheus::DEFAULT_BUCKETS, which seems broadly applicable. Depending on the data we collect, it might be worth fine-tuning the buckets later.

ordian

Agree with Basti that for identifying bottlenecks tracing-timing fits better.

OTOH, having timing information on a live chain in combination with alerts might be helpful.
Curious to know what @mxinden thinks about this.

If we decide to keep the metrics, we should reduce the amount to the known slowest offenders (not all of them are equally useful) and maybe use a specific suffix for timing metrics.

node/subsystem-util/src/lib.rs

node/core/bitfield-signing/src/lib.rs

…etheus_super

node/core/av-store/src/lib.rs

mxinden · 2020-11-11T09:56:48Z

Agree with Basti that for identifying bottlenecks tracing-timing fits better.

OTOH, having timing information on a live chain in combination with alerts might be helpful.
Curious to know what @mxinden thinks about this.

If we decide to keep the metrics, we should reduce the amount to the known slowest offenders (not all of them are equally useful) and maybe use a specific suffix for timing metrics.

Is the motivation for this pull request a currently existing CPU bottleneck, or the general goal of catching such bottlenecks quickly in the future? As far as I can tell the observation timespan of some of these include await points, thus the observed time might be misleading as it could potentially include the time spent by unrelated logic on the CPU. If this is a pressing issue, I would suggest deploying a custom binary with these metrics to see if they are accurate and of any help. In case this should solve CPU issues in the future, I would suggest relying on CPU profiling like flamegraph as well as the already existing task related metrics such as the polkadot_tasks_polling_duration_* histogram (see the Substrate Service Tasks Grafana dashboards for more).

That said, I don't think these metrics are particularly costly. While histograms are more expensive than counters, this pull request does not use any HistogramVecs.

ordian · 2020-11-11T12:05:46Z

Thanks for your input @mxinden!

Is the motivation for this pull request a currently existing CPU bottleneck, or the general goal of catching such bottlenecks quickly in the future?

The latter, having metrics on a live chain will help us identify a problem more quickly.

As far as I can tell the observation timespan of some of these include await points, thus the observed time might be misleading as it could potentially include the time spent by unrelated logic on the CPU

That's a good point. But observation timespans will give us an understanding how long e.g. certain requests take time end-to-end. But I agree that sampling profilers like perf used by flamegraph answer how much CPU time is spent in an async task.

…lidation code

Instead, get these values with ```sh target/release/adder-collator export-genesis-state target/release/adder-collator export-genesis-wasm ``` And then register the parachain on https://polkadot.js.org/apps/?rpc=ws%3A%2F%2F127.0.0.1%3A9944#/explorer To collect prometheus data, after running the script, create `prometheus.yml` per the instructions at https://www.notion.so/paritytechnologies/Setting-up-Prometheus-locally-835cb3a9df7541a781c381006252b5ff and then run: ```sh docker run -v `pwd`/prometheus.yml:/etc/prometheus/prometheus.yml:z --network host prom/prometheus ``` Demonstrates that data makes it across to prometheus, though it is likely to be useful in the future to tweak the buckets.

parachain/test-parachains/adder/collator/src/lib.rs

drahnr · 2020-11-18T14:48:12Z

scripts/adder-collator.sh

@@ -0,0 +1,191 @@
+#!/usr/bin/env bash


Should we consider writing some rust harness code and verifying the prometheus events? I would prefer that over a complex bash script.

That's not a bad idea, but unless there's library support for submitting transactions to replace polkadot-js, it doesn't really get us much. Bash is an ugly language, but it's hard to beat it when your task is fundamentally to coordinate launching a bunch of stuff on the command line.

That said, if there's library support in Rust so it's possible to automate the registration, then that's a game-changer. I just don't know of it, if it exists.

I think for now this is ok. Created a tracking issue #1991

We support sending transactions from rust tests, there is no need to use polkadot-js for this. Do you even took a look at the integration test of the collator?

Yes, but my intent was to run them standalone for debugging purposes, not as part of the test suite. It's easier to reach for scripts/adder-collator.sh than cargo test -p whatever-module -- --nocapture --no-timeout, or whatever the proper incantation ends up being.

I don't really agree with that, usually it should always be run with the the test suite, not only by some subset of devs that are aware of its existance.
The approach of adding shell scripts imho also doesn't scale well beyond 3.

And we also have polkadot-launch by @shawntabrizi

ordian

Could you show us a screenshot of how histogram metrics look like and resolve merge conflicts please?

parachain/test-parachains/adder/collator/src/cli.rs

Co-authored-by: Andronik Ordian <write@reusable.software>

coriolinus · 2020-11-20T13:00:59Z

Could you show us a screenshot of how histogram metrics look like?

Prometheus doesn't render the histogram metrics, unfortunately; that's left to integration with Grafana. I verified that data is copied into Prometheus, but haven't bothered setting up Grafana for demo purposes: it felt like something which would take a fair amount of time (mostly because I'm unfamiliar with Grafana) without actually showing anything new; prometheus and grafana are well-known for working together.

What you can do right now is query the tabular data in prometheus by querying the polkadot_*_bucket fields. This brings up a table showing, for a range of time scales, what number of traces took less than that amount of time. Note that this data is cumulative: anything taking less than 0.05 seconds also took less than 0.1 seconds, etc. I'm sure you can do more with it than that, but I am not very fluent in prometheus queries.

coriolinus · 2020-11-20T13:18:22Z

bot merge

ghost · 2020-11-20T13:18:26Z

Missing process info; check that the PR belongs to a project column.

Merge can be attempted if:

The PR has approval from two core-devs (or one if the PR is labelled insubstantial).
The PR has approval from a member of substrateteamleads.
The PR is attached to a project column and has approval from the project owner.

See https://github.com/paritytech/parity-processbot#faq

parachain/test-parachains/adder/collator/src/cli.rs

.

* master: Improve collator side of the collator-protocol (#1955) add parity-keyring to install instructions (#1993) sane messaging defaults (#1994) cleanup validator discovery (#1992) Add Prometheus timers to the subsystems (#1923) Add tracing support to node (#1940)

coriolinus added 2 commits November 5, 2020 17:31

reexport prometheus-super for ease of use of other subsystems

9c9115d

add some prometheus timers for collation generation subsystem

6731b13

coriolinus added A3-in_progress Pull request is in progress. No review needed at this stage. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. labels Nov 5, 2020

coriolinus added 15 commits November 6, 2020 12:53

add timing metrics to av-store

90e6501

add metrics to candidate backing

2bef414

add timing metric to bitfield signing

c1a0682

add timing metrics to candidate selection

375ff8e

add timing metrics to candidate-validation

92c2c08

add timing metrics to chain-api

c54e53e

add timing metrics to provisioner

3184894

add timing metrics to runtime-api

03f4aa3

add timing metrics to availability-distribution

4c80828

add timing metrics to bitfield-distribution

7c468bc

add timing metrics to collator protocol: collator side

6e9de80

add timing metrics to collator protocol: validator side

5207a6b

fix candidate validation test failures

58ed08b

add timing metrics to pov distribution

33a5d21

add timing metrics to statement-distribution

b18f029

coriolinus marked this pull request as ready for review November 9, 2020 10:33

Merge remote-tracking branch 'origin/master' into prgn-prometheus-timing

e3327da

ordian reviewed Nov 9, 2020

View reviewed changes

node/subsystem-util/src/lib.rs Outdated Show resolved Hide resolved

bkchr reviewed Nov 9, 2020

View reviewed changes

node/core/bitfield-signing/src/lib.rs Outdated Show resolved Hide resolved

coriolinus added 2 commits November 9, 2020 13:11

use substrate_prometheus_endpoint prometheus reexport instead of prom…

2b752e4

…etheus_super

don't include JOB_DELAY in bitfield-signing metrics

92538c7

drahnr reviewed Nov 10, 2020

View reviewed changes

node/core/av-store/src/lib.rs Show resolved Hide resolved

coriolinus added A0-please_review Pull request needs code review. and removed A3-in_progress Pull request is in progress. No review needed at this stage. labels Nov 13, 2020

coriolinus added 5 commits November 17, 2020 11:00

Merge remote-tracking branch 'origin/master' into prgn-prometheus-timing

0795d4f

give adder-collator ability to easily export its genesis-state and va…

f3c2a14

…lidation code

wip: adder-collator pushbutton script

fdb842e

Merge remote-tracking branch 'origin/master' into prgn-prometheus-timing

6b93f48

drahnr reviewed Nov 18, 2020

View reviewed changes

parachain/test-parachains/adder/collator/src/lib.rs Show resolved Hide resolved

drahnr reviewed Nov 18, 2020

View reviewed changes

Merge remote-tracking branch 'origin/master' into prgn-prometheus-timing

5e29774

coriolinus requested a review from rphmeier November 19, 2020 10:10

Merge remote-tracking branch 'origin/master' into prgn-prometheus-timing

e349bf6

ordian reviewed Nov 20, 2020

View reviewed changes

parachain/test-parachains/adder/collator/src/cli.rs Outdated Show resolved Hide resolved

parachain/test-parachains/adder/collator/src/cli.rs Outdated Show resolved Hide resolved

coriolinus and others added 2 commits November 20, 2020 13:41

Update parachain/test-parachains/adder/collator/src/cli.rs

25831d4

Co-authored-by: Andronik Ordian <write@reusable.software>

use the grandpa-pause parameter

8dc1806

Merge remote-tracking branch 'origin/master' into prgn-prometheus-timing

b4a34f5

ordian approved these changes Nov 20, 2020

View reviewed changes

skip metrics in tracing instrumentation

c387845

drahnr approved these changes Nov 20, 2020

View reviewed changes

bkchr previously requested changes Nov 20, 2020

View reviewed changes

parachain/test-parachains/adder/collator/src/cli.rs Outdated Show resolved Hide resolved

remove unnecessary grandpa_pause cli param

6f671ed

coriolinus merged commit e655654 into master Nov 20, 2020

coriolinus deleted the prgn-prometheus-timing branch November 20, 2020 14:04

Conversation

coriolinus commented Nov 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bkchr commented Nov 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coriolinus commented Nov 6, 2020

Uh oh!

coriolinus commented Nov 9, 2020

Uh oh!

ordian left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mxinden commented Nov 11, 2020

Uh oh!

ordian commented Nov 11, 2020

Uh oh!

Uh oh!

drahnr Nov 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coriolinus Nov 18, 2020

Choose a reason for hiding this comment

Uh oh!

drahnr Nov 20, 2020

Choose a reason for hiding this comment

Uh oh!

bkchr Nov 20, 2020

Choose a reason for hiding this comment

Uh oh!

coriolinus Nov 20, 2020

Choose a reason for hiding this comment

Uh oh!

drahnr Nov 20, 2020

Choose a reason for hiding this comment

Uh oh!

bkchr Nov 20, 2020

Choose a reason for hiding this comment

Uh oh!

ordian left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coriolinus commented Nov 20, 2020

Uh oh!

coriolinus commented Nov 20, 2020

Uh oh!

ghost commented Nov 20, 2020

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

coriolinus commented Nov 5, 2020 •

edited

Loading

bkchr commented Nov 5, 2020 •

edited

Loading

drahnr Nov 18, 2020 •

edited

Loading