Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Consolidate subsystem spans so they are all children of the leaf-activated root span #6458

Merged
merged 104 commits into from
Mar 31, 2023

Conversation

bredamatt
Copy link
Contributor

@bredamatt bredamatt commented Dec 20, 2022

See: #6044

Consolidates all spans so they are children spans of an active leaf span (this is except for those spans which cannot be by design).

This allows us to trace the protocol execution from a new leaf is activated until it is finalized.

A few changes have to be made to allow for this:

  • We need to store spans in some subsystems' state
  • We need to prune spans on block finalized signals to prevent memory leaks in accordance with what the subsystem expects

This PR touches upon:

  • approval-voting
  • approval-distribution
  • availability-distribution

The current jaeger crate can create a new jaeger::Span with a traceID based on the candidate hash. However, the candidate hash is not available in every subsystem. Therefore, some spans have previously been orphan spans (i.e. a span without a leaf-activated parent). Some changes have been made so that a traceID tag can be created during child span creation, given a candidate hash. This is to allow consolidation under a shared parent span (the leaf-activated span).

The downside to tracing every subsystem based on the leaf-activated rather than as a separate trace (the root span is the start of a trace, hence every orphan span has a unique trace id) is that the traceID for individual subsystems inherit the traceID of the leaf activated span. The tag was introduced to have a holistic view of how a candidate flows through the protocol across all subsystems, whilst still allowing us to search for spans using the traceID in logs.

Here is a screen of some of the more relevant spans from availability-distribution:
Screenshot 2023-02-24 at 09 48 38

As is noticeable, the fetch-task-config has a traceID tag.

Similarly, for approval-voting it is possible to see the traceID on a processed wakeup:
Screenshot 2023-02-24 at 11 34 26

More details on the approval-voting span, and how the traceID is used throughout spans can be seen here:
Screenshot 2023-02-24 at 11 36 52

@bredamatt bredamatt self-assigned this Dec 20, 2022
@bredamatt bredamatt linked an issue Dec 20, 2022 that may be closed by this pull request
@bredamatt bredamatt added B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. labels Dec 20, 2022
@mrcnski mrcnski self-requested a review December 30, 2022 14:28
node/core/approval-voting/src/import.rs Outdated Show resolved Hide resolved
node/core/approval-voting/src/lib.rs Outdated Show resolved Hide resolved
node/core/approval-voting/src/lib.rs Show resolved Hide resolved
node/core/approval-voting/src/lib.rs Outdated Show resolved Hide resolved
node/core/approval-voting/src/lib.rs Outdated Show resolved Hide resolved
// Also spawn or bump tasks for candidates in ancestry in the same session.
for hash in std::iter::once(leaf).chain(ancestors_in_session) {
let span = span
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this kind of verbosity.

Copy link
Contributor Author

@bredamatt bredamatt Feb 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't include a span here, how do you suggest we pass a span to add_cores() in the loop? Without passing a child span to add_cores() there seems to be no way to trace the execution of FetchTask as a child span of the leaf-activated span.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean. In that case, the span name sounds misleading.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that add_cores also creates an additional span. Maybe we can create the span there and just pass it a ref to the request-chunks span we created above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will take a look now :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed - naming was off. I renamed this span and the one above as request-chunks-new-head and request-chunks-ancestor accordingly

Copy link
Contributor Author

@bredamatt bredamatt Mar 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding add_cores also creating an additional span;

  1. The add_cores call is invoked per ancestor hash. If I understood it correctly, we go [SESSION_ANCESTRY_LEN] back when requesting chunks, so we would need to associate a request-chunk-ancestor span with each ancestor.
  2. Inside add_cores, there is a loop over all the occupied cores, and for some of the occupied cores the entries are vacant. If they are vacant, the new FetchTaskConfig is created, and eventually that FetchTask will run and create the run-fetch-chunk-task span (and its children).

This way we have a relationship like this:

request-chunk-new-head
|
|__ request-chunk-ancestor (ancestor_0)
    |
    |___ check-fetch-candidate (occupied_core_0)
    |___ check-fetch-candidate (occupied_core_1)
         |___ fetch-task-config (when entry for occupied_core is vacant)
              |__ run-fetch-chunk-task
                  |__ ++ 
    |___ check-fetch-candidate (occupied_core_N)
|
|__ request-chunk-ancestor (ancestor_1)
    |___ check-fetch-candidate (occupied_core_0)
    |___ check-fetch-candidate (occupied_core_1)
    |___ check-fetch-candidate (occupied_core_N)

In this case, I think it makes the most sense to keep the additional span for the added context that it belongs to a particular ancestor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! Thanks, it will be interesting to see anything relate to what is slowing down PoV recovery.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely!

Copy link
Contributor

@sandreim sandreim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks @bredamatt . One last suggestion, but otherwise LGTM

// Also spawn or bump tasks for candidates in ancestry in the same session.
for hash in std::iter::once(leaf).chain(ancestors_in_session) {
let span = span
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean. In that case, the span name sounds misleading.

// Also spawn or bump tasks for candidates in ancestry in the same session.
for hash in std::iter::once(leaf).chain(ancestors_in_session) {
let span = span
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that add_cores also creates an additional span. Maybe we can create the span there and just pass it a ref to the request-chunks span we created above.

@@ -281,6 +287,11 @@ impl RunningTask {
continue
},
};
drop(_chunk_fetch_span);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to drop manually? Not just here but wondering in general.

Copy link
Contributor Author

@bredamatt bredamatt Mar 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default spans are "dropped" when they go out of scope. However, it is not truly dropped (hence the ""). Under the hood, there are various ways a "dropped" span can be sent to a tracing backend from a tracing agent, like Grafana tempo for example. Therefore, sometimes, you may want to drop a span to control how and when spans are being sent for further processing.

In general though, creating spans is a bit of an art since it is based on how long one wants a particular span to to be in scope, based on some higher-level understanding of how a codebase works. In this case, the drop is done manually because the _chunk_fetch_span was created with the intention to only capture the duration of the request sent to get a chunk, and not anything else. Further calls are handled with other spans accordingly, and the outer span (the parent of the _chunk_fetch_span), would still conclude when that has itself gone out of scope / been "dropped". In other words, manually "dropping" gives some more flexibility / precision with regards to the end of a span.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bredamatt Documenting that in a comment seems like it would be good! In general, manual drops should be accompanied with some justification IMO.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default spans are "dropped" when they go out of scope. However, it is not truly dropped (hence the ""). Under the hood, there are various ways a "dropped" span can be sent to a tracing backend from a tracing agent, like Grafana tempo for example. Therefore, sometimes, you may want to drop a span to control how and when spans are being sent for further processing.

If I get this right, would it be a good idea to wrap the drop in something like send_span_to_XXX(span_thing: YYY) { drop(span_thing); } and then follow @mrcnski comment about documentation?

Thanks for the explanation though!

Copy link
Contributor Author

@bredamatt bredamatt Mar 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexgparity yes, you are right at a high-level. This is somewhat how it works under the hood, but obviously with more detail as it is based on which tracing stack is used and how it is configured from a performance tuning perspective: https://www.jaegertracing.io/docs/1.23/performance-tuning/

In our case, there is an external crate which handles how spans are being passed to Grafana Tempo - this is the mick-jaeger crate, a wrapper around jaeger. Changing mick-jaeger is not really in scope for this PR since we are using mick-jaeger throughout all subsystems from beforehand.

Copy link
Member

@ordian ordian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice from the quick glance. I haven't looked at the generated spans, but we can adjust them later if something looks off.

node/core/approval-voting/src/lib.rs Outdated Show resolved Hide resolved
@sandreim
Copy link
Contributor

bot merge

@paritytech-processbot paritytech-processbot bot merged commit 21c8734 into master Mar 31, 2023
@paritytech-processbot paritytech-processbot bot deleted the bredamatt/add-missing-traces branch March 31, 2023 15:54
ordian added a commit that referenced this pull request Apr 11, 2023
* master: (28 commits)
  Remove years from copyright notes (#7034)
  Onchain scraper in `dispute-coordinator` will scrape `SCRAPED_FINALIZED_BLOCKS_COUNT` blocks before finality (#7013)
  PVF: Minor refactor in workers code (#7012)
  Expose WASM bulk memory extension in execution environment parameters (#7008)
  Co #13699: Remove old calls (#7003)
  Companion for paritytech/substrate#13811 (#6998)
  PR review rules, include all rs files except weights (#6990)
  Substrate companion: Remove deprecated batch verification (#6999)
  Added `origin` to config for `universal_origin` benchmark (#6986)
  Cache `SessionInfo` on new activated leaf in `dispute-distribution` (#6993)
  Update Substrate to fix Substrate companions (#6994)
  Consolidate subsystem spans so they are all children of the leaf-activated root span (#6458)
  Avoid redundant clone. (#6989)
  bump zombienet version (#6985)
  avoid triggering unwanted room_id for the release notifs (#6984)
  Add crowdloan to SafeCallFilter (#6903)
  Drop timers for new requests of active participations (#6974)
  Use `SIGTERM` instead of `SIGKILL` on PVF worker version mismatch (#6981)
  Tighter bound on asset types teleported so that weight is cheaper (#6980)
  staking miner: less aggresive submissions (#6978)
  ...
ordian added a commit that referenced this pull request Apr 12, 2023
* master: (25 commits)
  [Deps] bump scale-info to be in line with cumulus (#7049)
  Invoke cargo build commands with `--locked` (#7057)
  use stable rust toolchain in ci
  apply clippy 1.68 suggestions
  Remove years from copyright notes (#7034)
  Onchain scraper in `dispute-coordinator` will scrape `SCRAPED_FINALIZED_BLOCKS_COUNT` blocks before finality (#7013)
  PVF: Minor refactor in workers code (#7012)
  Expose WASM bulk memory extension in execution environment parameters (#7008)
  Co #13699: Remove old calls (#7003)
  Companion for paritytech/substrate#13811 (#6998)
  PR review rules, include all rs files except weights (#6990)
  Substrate companion: Remove deprecated batch verification (#6999)
  Added `origin` to config for `universal_origin` benchmark (#6986)
  Cache `SessionInfo` on new activated leaf in `dispute-distribution` (#6993)
  Update Substrate to fix Substrate companions (#6994)
  Consolidate subsystem spans so they are all children of the leaf-activated root span (#6458)
  Avoid redundant clone. (#6989)
  bump zombienet version (#6985)
  avoid triggering unwanted room_id for the release notifs (#6984)
  Add crowdloan to SafeCallFilter (#6903)
  ...
ordian added a commit that referenced this pull request Apr 12, 2023
…slashing-client

* ao-past-session-slashing-runtime: (25 commits)
  [Deps] bump scale-info to be in line with cumulus (#7049)
  Invoke cargo build commands with `--locked` (#7057)
  use stable rust toolchain in ci
  apply clippy 1.68 suggestions
  Remove years from copyright notes (#7034)
  Onchain scraper in `dispute-coordinator` will scrape `SCRAPED_FINALIZED_BLOCKS_COUNT` blocks before finality (#7013)
  PVF: Minor refactor in workers code (#7012)
  Expose WASM bulk memory extension in execution environment parameters (#7008)
  Co #13699: Remove old calls (#7003)
  Companion for paritytech/substrate#13811 (#6998)
  PR review rules, include all rs files except weights (#6990)
  Substrate companion: Remove deprecated batch verification (#6999)
  Added `origin` to config for `universal_origin` benchmark (#6986)
  Cache `SessionInfo` on new activated leaf in `dispute-distribution` (#6993)
  Update Substrate to fix Substrate companions (#6994)
  Consolidate subsystem spans so they are all children of the leaf-activated root span (#6458)
  Avoid redundant clone. (#6989)
  bump zombienet version (#6985)
  avoid triggering unwanted room_id for the release notifs (#6984)
  Add crowdloan to SafeCallFilter (#6903)
  ...
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders.
Development

Successfully merging this pull request may close these issues.

approval-voting: Add missing traces
6 participants