Skip to content
This repository was archived by the owner on Nov 15, 2023. It is now read-only.

Add Prometheus timers to the subsystems#1923

Merged
coriolinus merged 32 commits intomasterfrom
prgn-prometheus-timing
Nov 20, 2020
Merged

Add Prometheus timers to the subsystems#1923
coriolinus merged 32 commits intomasterfrom
prgn-prometheus-timing

Conversation

@coriolinus
Copy link
Contributor

@coriolinus coriolinus commented Nov 5, 2020

This should assist in determining where the subsystems spend their time, which should help isolate performance issues.

Note: all these timers are cumulative timers: they include time spent in other parts of the stack. They observe automatically on drop.

Timers are implemented for:

  • availability_distribution
  • availability_store
  • bitfield_distribution
  • bitfield_signing
  • candidate_backing
  • candidate_selection
  • candidate_validation
  • chain_api
  • collation_generation
  • collator_protocol
  • network_bridge Does not have any existing metrics; presumed to be uninteresting.
  • pov_distribution
  • provisioner
  • runtime_api
  • statement_distribution

TODO:

  • Run a local net with a few nodes; collect up the statistics in local grafana; demo the output

@coriolinus coriolinus added A3-in_progress Pull request is in progress. No review needed at this stage. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. labels Nov 5, 2020
@bkchr
Copy link
Member

bkchr commented Nov 5, 2020

AFAIK prometheus is not really done for performance metrics?

https://github.com/tokio-rs/tracing probably is something that is better suited for this?

@coriolinus
Copy link
Contributor Author

I see prometheus and tracing as answering different questions. Prometheus answers the question, "What is taking a while"; it lets us collect aggregated statistics about the performance characteristics of our async functions. Tracing answers the question, "What is the sequence of events that is happening"?

Both can be useful--in particular, tracing is a lot less worried about data cardinality, and so encourages us to tag events by i.e. the relay parent hash to which they're attached--but given that we already have prometheus set up and integrated, I'm going to to finish this up, then work on tracing as a separate task.

@coriolinus coriolinus marked this pull request as ready for review November 9, 2020 10:33
@coriolinus
Copy link
Contributor Author

Note that all metrics currently use prometheus::DEFAULT_BUCKETS, which seems broadly applicable. Depending on the data we collect, it might be worth fine-tuning the buckets later.

Copy link

@ordian ordian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with Basti that for identifying bottlenecks tracing-timing fits better.

OTOH, having timing information on a live chain in combination with alerts might be helpful.
Curious to know what @mxinden thinks about this.

If we decide to keep the metrics, we should reduce the amount to the known slowest offenders (not all of them are equally useful) and maybe use a specific suffix for timing metrics.

@mxinden
Copy link
Contributor

mxinden commented Nov 11, 2020

Agree with Basti that for identifying bottlenecks tracing-timing fits better.

OTOH, having timing information on a live chain in combination with alerts might be helpful.
Curious to know what @mxinden thinks about this.

If we decide to keep the metrics, we should reduce the amount to the known slowest offenders (not all of them are equally useful) and maybe use a specific suffix for timing metrics.

Is the motivation for this pull request a currently existing CPU bottleneck, or the general goal of catching such bottlenecks quickly in the future? As far as I can tell the observation timespan of some of these include await points, thus the observed time might be misleading as it could potentially include the time spent by unrelated logic on the CPU. If this is a pressing issue, I would suggest deploying a custom binary with these metrics to see if they are accurate and of any help. In case this should solve CPU issues in the future, I would suggest relying on CPU profiling like flamegraph as well as the already existing task related metrics such as the polkadot_tasks_polling_duration_* histogram (see the Substrate Service Tasks Grafana dashboards for more).

That said, I don't think these metrics are particularly costly. While histograms are more expensive than counters, this pull request does not use any HistogramVecs.

@ordian
Copy link

ordian commented Nov 11, 2020

Thanks for your input @mxinden!

Is the motivation for this pull request a currently existing CPU bottleneck, or the general goal of catching such bottlenecks quickly in the future?

The latter, having metrics on a live chain will help us identify a problem more quickly.

As far as I can tell the observation timespan of some of these include await points, thus the observed time might be misleading as it could potentially include the time spent by unrelated logic on the CPU

That's a good point. But observation timespans will give us an understanding how long e.g. certain requests take time end-to-end. But I agree that sampling profilers like perf used by flamegraph answer how much CPU time is spent in an async task.

@coriolinus coriolinus added A0-please_review Pull request needs code review. and removed A3-in_progress Pull request is in progress. No review needed at this stage. labels Nov 13, 2020
Instead, get these values with

```sh
target/release/adder-collator export-genesis-state
target/release/adder-collator export-genesis-wasm
```

And then register the parachain on https://polkadot.js.org/apps/?rpc=ws%3A%2F%2F127.0.0.1%3A9944#/explorer

To collect prometheus data, after running the script, create `prometheus.yml` per the instructions
at https://www.notion.so/paritytechnologies/Setting-up-Prometheus-locally-835cb3a9df7541a781c381006252b5ff
and then run:

```sh
docker run -v `pwd`/prometheus.yml:/etc/prometheus/prometheus.yml:z --network host prom/prometheus
```

Demonstrates that data makes it across to prometheus, though it is likely to be useful in the future
to tweak the buckets.
@@ -0,0 +1,191 @@
#!/usr/bin/env bash
Copy link
Contributor

@drahnr drahnr Nov 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider writing some rust harness code and verifying the prometheus events? I would prefer that over a complex bash script.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not a bad idea, but unless there's library support for submitting transactions to replace polkadot-js, it doesn't really get us much. Bash is an ugly language, but it's hard to beat it when your task is fundamentally to coordinate launching a bunch of stuff on the command line.

That said, if there's library support in Rust so it's possible to automate the registration, then that's a game-changer. I just don't know of it, if it exists.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for now this is ok. Created a tracking issue #1991

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We support sending transactions from rust tests, there is no need to use polkadot-js for this. Do you even took a look at the integration test of the collator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but my intent was to run them standalone for debugging purposes, not as part of the test suite. It's easier to reach for scripts/adder-collator.sh than cargo test -p whatever-module -- --nocapture --no-timeout, or whatever the proper incantation ends up being.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really agree with that, usually it should always be run with the the test suite, not only by some subset of devs that are aware of its existance.
The approach of adding shell scripts imho also doesn't scale well beyond 3.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we also have polkadot-launch by @shawntabrizi

@coriolinus coriolinus requested a review from rphmeier November 19, 2020 10:10
Copy link

@ordian ordian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you show us a screenshot of how histogram metrics look like and resolve merge conflicts please?

coriolinus and others added 2 commits November 20, 2020 13:41
@coriolinus
Copy link
Contributor Author

Could you show us a screenshot of how histogram metrics look like?

Prometheus doesn't render the histogram metrics, unfortunately; that's left to integration with Grafana. I verified that data is copied into Prometheus, but haven't bothered setting up Grafana for demo purposes: it felt like something which would take a fair amount of time (mostly because I'm unfamiliar with Grafana) without actually showing anything new; prometheus and grafana are well-known for working together.

What you can do right now is query the tabular data in prometheus by querying the polkadot_*_bucket fields. This brings up a table showing, for a range of time scales, what number of traces took less than that amount of time. Note that this data is cumulative: anything taking less than 0.05 seconds also took less than 0.1 seconds, etc. I'm sure you can do more with it than that, but I am not very fluent in prometheus queries.

@coriolinus
Copy link
Contributor Author

bot merge

@ghost
Copy link

ghost commented Nov 20, 2020

Missing process info; check that the PR belongs to a project column.

Merge can be attempted if:

  • The PR has approval from two core-devs (or one if the PR is labelled insubstantial).
  • The PR has approval from a member of substrateteamleads.
  • The PR is attached to a project column and has approval from the project owner.

See https://github.com/paritytech/parity-processbot#faq

bkchr
bkchr previously requested changes Nov 20, 2020
@coriolinus coriolinus merged commit e655654 into master Nov 20, 2020
@coriolinus coriolinus deleted the prgn-prometheus-timing branch November 20, 2020 14:04
ordian added a commit that referenced this pull request Nov 23, 2020
* master:
  Improve collator side of the collator-protocol (#1955)
  add parity-keyring to install instructions (#1993)
  sane messaging defaults (#1994)
  cleanup validator discovery (#1992)
  Add Prometheus timers to the subsystems (#1923)
  Add tracing support to node (#1940)
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants