Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

pvf: Update docs for PVF artifacts #6551

Merged
merged 3 commits into from
Apr 19, 2023

Conversation

mrcnski
Copy link
Contributor

@mrcnski mrcnski commented Jan 13, 2023

Closes #3581 (no fix needed there, see comments. Just updating the docs).

@mrcnski mrcnski added A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. D3-trivial 🧸 PR contains trivial changes in a runtime directory that do not require an audit. labels Jan 13, 2023
//!
//! # Lifecycle of an artifact
//!
//! 1. During node start-up, the artifacts cache is cleaned up.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to re-prepare all PVF artifacts on each node restart? Or does "clean up" mean the artifacts' build host version matches the current host version?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, all local artifacts are cleared - we'll re-prepare the PVF the first time a new execute request comes in. I'll update the doc to make it more clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should imho avoid clearing the artifacts cache when the host did not change. We could recompile only the parachain blocks when the host version did change, then lazily recompile the parathread ones.

Replying here to keep the threads focused. Yeah, thinking about it, I actually don't see why we want to clear the artifacts cache. On start-up, we could instead re-populate the Artifacts table from the compiled artifacts on disk -- the PVF hash should already be in the filename -- and re-start the 24-hour TTL timers for each artifact. Or we could even use the system's last-modified/accessed metadata for the files (with some sanity checks). Then instead of lazily re-compiling the PVFs, we would lazily delete the ones we end up not needing, which seems a lot more efficient. 🙂

@s0me0ne-unkn0wn Would this need coordination with your execution environment PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't see the last comment yet and continued in that thread. Yes, I forgot about 24h timer, it solves the problem. Do I understand correctly that the proposal is to keep artifacts only if the node is not upgraded? In that case, it might work. But we should always purge the artifacts if the node is upgraded.

Copy link
Contributor

@s0me0ne-unkn0wn s0me0ne-unkn0wn Jan 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this need coordination with your execution environment PR?

Special coordination is not needed, I'm already used to merging master to that branch three times a week 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't see the last comment yet and continued in that thread. Yes, I forgot about 24h timer, it solves the problem. Do I understand correctly that the proposal is to keep artifacts only if the node is not upgraded? In that case, it might work. But we should always purge the artifacts if the node is upgraded.

Yes, makes sense to me!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

//! 1. If the artifact is already being processed, we add another execution request to the
//! existing preparation job, without starting a new one.
//!
//! 2. Note that if the state is [`ArtifactState::FailedToProcess`], we usually do not retry
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not currently have any "I'm taking a long time" messages, so if we send out approval assignments but do artifact builds lazily, then we'll cause no shows, given that builds can take more than the 12? second no show time out.

In theory, we could send messages for "building artifact" and/or "It's slow but I'm here", but @rphmeier wanted to avoid complicating the approval process with such messages, probably a wise decision. We therefore need PVF artifacts to be built in advance, or else we suck up the risk of correlated artifact builds creating de fact escalations.

Copy link
Member

@eskimor eskimor Jan 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I was wondering about this a couple of times already myself. I think for the time being, preparation is usually pretty fast so there are no issues.

The problem with preparation in advance is, that this will likely result in wasted effort in case of parathreads. As all validators would need to prepare a PVF, although only 30 approval checkers will actually need it. Might be fine.

Other options:

  1. Gather some data on actual preparation times - if they turn out to be very low, we could just enforce a reasonably low value in pre-checking.
  2. Interpret parathread PVFs instead of compiling. Given that they are expected to be executed only rarely, the compiliation effort is likely not worth the trouble anyway. (And for parachains prepare in advance)

Copy link
Contributor

@burdges burdges Jan 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should imho avoid clearing the artifacts cache when the host did not change. We could recompile only the parachain blocks when the host version did change, then lazily recompile the parathread ones.

We should've timings of course, but we'll never stop people building wasm blobs that screw up build times intentionally.

Interpreting kinda works. We have consensus upon who gets compiled vs interpreted, so interpreted then runs with different approval time parameters. We could similarly adjust approval parameters to include recompiling parathreads each block. This makes parathreads more expensive and second class though. We could've parathreads that "buy" being compiled in advance like parachains.

We do still have everyone compile the parathread when the PVF initially gets uploaded though, yes? I'd think this suggests parathreads and parachains should be all be precompiled, which just makes uploading a PVF more expensive. Implicitly then host upgrades become relatively more expensive, but this makes sense too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should imho avoid clearing the artifacts cache when the host did not change.

We'd need an intelligent garbage collector, then. Imagine the node is restarted and has a hundred artifacts in the cache. How do we know which ones we will use and which are stale?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should've timings of course, but we'll never stop people building wasm blobs that screw up build times intentionally.

Why? They would not pass the pre-checking phase, assuming:

Gather some data on actual preparation times - if they turn out to be very low, we could just enforce a reasonably low value in pre-checking.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd need to garbage collect PVFs when parachains deregister or when the PVF gets superseded by later PVFs. We could do pre-checking after a PVF upload gets finalized, so then we avoid fork concerns and each parachan has at most two PVFs in the cache. We'd loose the ability to upgrade parachain PVFs when finality stalls though, so system parachain could require some escape hatch here.

As always we pay for optimizations with complexity. Ain't clear how far this should go right now of course. We could stick with the current proposal for now, but make an issue for smarter PVF garbage collection in future.

Why? They would not pass the pre-checking phase, assuming:

It'll be possible to pass the pre-checking but be quite slow compared with average PVF builds.

It'll occasionally be possible to pass the pre-checking on one host, but be abysmal on some host upgrade in the pipeline. We could imho ignore this risk though, so yeah maybe you're right..

Copy link
Contributor Author

@mrcnski mrcnski Jan 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As always we pay for optimizations with complexity. Ain't clear how far this should go right now of course. We could stick with the current proposal for now, but make an issue for smarter PVF garbage collection in future.

That does sound like it would introduce some complexity. I guess my question would be, is it really necessary? How much disk space does each compiled PVF actually require? How bad is it to keep old artifacts around? So far, it seems that we have not had issues with the 24-hour TTL of artifacts AFAIK, so my grug brain thinks that we shouldn't introduce unnecessary optimizations. For parathreads I could see the artifacts needing to stay around for longer - but in that case I would just have a longer TTL for those, and not worry about the extra used disk space. 😛

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this complexity comes from one scenario: An adversary can create many relay chain forks, so they can upload one artifact on each fork. We support all forks until we know which fork survives.

We've rough consensus on artifact age so we could use duration as a proxy for finality though: We retain all artifacts compatible with the current host, so long as either the artifact is active on some relay chain fork, or else the artifact is less than 24 hours old.

We add some abandon artifact call for artifacts uploaded but not activated, or else force activation at some block height, or something like that.

It's messy to create many relay chain forks without equivocation, so the attack might already result in slashing, which maybe suffices. If you've a run of blocks, then you could've some forks without equivocations, but not too many.

It's maybe just easier to wait for finality and have some override for system parachains, not sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@burdges
Copy link
Contributor

burdges commented Jan 13, 2023

Above my comments are not critiques of the text of course, but this is a convenient place to have a discussion that clarifies some design points, and maybe that discussion will be useful here.

@mrcnski
Copy link
Contributor Author

mrcnski commented Mar 23, 2023

Since the discussions here seem to have stalled, I have raised a couple of issues to unblock this PR.

This is a small PR that aims to document the current state of affairs, so discussion about future design can happen in those issues.

@mrcnski mrcnski requested review from burdges and eskimor and removed request for burdges March 23, 2023 12:32
@mrcnski
Copy link
Contributor Author

mrcnski commented Apr 18, 2023

Bump. :)

@mrcnski
Copy link
Contributor Author

mrcnski commented Apr 19, 2023

bot merge

@paritytech-processbot paritytech-processbot bot merged commit 5fd2bf8 into master Apr 19, 2023
@paritytech-processbot paritytech-processbot bot deleted the mrcnski/pvf-update-artifact-docs branch April 19, 2023 16:08
ordian added a commit that referenced this pull request Apr 26, 2023
* master: (30 commits)
  update rocksdb to 0.20.1 (#7113)
  Reduce base proof size weight component to zero (#7081)
  PVF: Move PVF workers into separate crate (#7101)
  Companion for #13923 (#7111)
  update safe call filter (#7080)
  PVF: Don't dispute on missing artifact (#7011)
  XCM: Properly set the pricing for the DMP router (#6843)
  pvf: Update docs for PVF artifacts (#6551)
  Bump syn from 2.0.14 to 2.0.15 (#7093)
  Companion for substrate#13771 (#6983)
  Added Dwellir Nigeria bootnodes. (#7097)
  Companion for Substrate #13889 (#7063)
  Switch to DNS name based bootnodes for Rococo (#7040)
  companion for substrate#13883 (#7059)
  [xcm] Added `UnpaidExecution` instruction to `UnpaidRemoteExporter` (#7091)
  Bump serde_json from 1.0.85 to 1.0.96 (#7072)
  Bump hex-literal from 0.3.4 to 0.4.1 (#7071)
  Small simplification (#7089)
  [XCM - UnpaidRemoteExporter] Remove unreachable code (#7088)
  sync versions with current release (#7083)
  ...
ordian added a commit that referenced this pull request Apr 26, 2023
* master: (39 commits)
  malus: dont panic on missing validation data (#6952)
  Offences Migration v1: Removes `ReportsByKindIndex` (#7114)
  Fix stalling dispute coordinator. (#7125)
  Fix rolling session window (#7126)
  [ci] Update buildah command and version (#7128)
  Bump assigned_slots params (#6991)
  XCM: Remote account converter (#6662)
  Rework `dispute-coordinator` to use `RuntimeInfo` for obtaining session information instead of `RollingSessionWindow` (#6968)
  Revert default proof size back to 64 KB (#7115)
  update rocksdb to 0.20.1 (#7113)
  Reduce base proof size weight component to zero (#7081)
  PVF: Move PVF workers into separate crate (#7101)
  Companion for #13923 (#7111)
  update safe call filter (#7080)
  PVF: Don't dispute on missing artifact (#7011)
  XCM: Properly set the pricing for the DMP router (#6843)
  pvf: Update docs for PVF artifacts (#6551)
  Bump syn from 2.0.14 to 2.0.15 (#7093)
  Companion for substrate#13771 (#6983)
  Added Dwellir Nigeria bootnodes. (#7097)
  ...
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. D3-trivial 🧸 PR contains trivial changes in a runtime directory that do not require an audit.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PVF validation failures
6 participants