Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PVF: Consider treating RuntimeConstruction as an internal execution error #661

Open
mrcnski opened this issue Apr 10, 2023 · 14 comments
Open

Comments

@mrcnski
Copy link
Contributor

mrcnski commented Apr 10, 2023

ISSUE

Overview

sc_executor_common::error::Error::RuntimeConstruction should theoretically only happen due to some local issue, and not due to the PVF itself. Therefore when it arises during execution we may want to treat it as internal (i.e. not vote against -- possibly raising a dispute -- as we do right now). We just have to ensure that RuntimeConstruction is never constructed due to an issue with the PVF, and document it as such on the Substrate side.

We may be able to treat some of the other error variants the same way. This would help to prevent disputes due to local issues.

Important note 1

Credit to @eskimor for this section.

In general, for errors not raising a dispute we have to very careful. This is only sound, if we either:

  1. Ruled out that error in pre-checking. If something is not checked in pre-checking, even if candidate independent we must raise a dispute.
  2. We are 100% confident that it is a hardware/local issue: Like corrupted file, etc.

Reasoning: Otherwise it would be possible to register a PVF where candidates can not be checked, but we don't get a dispute - so nobody gets punished. Second we end up with a finality stall that is not going to resolve!

Important note 2

See #661 (comment).

Related issue

Somewhat related issue about WasmError::Other variant (which gets turned into Error::RuntimeConstruction):
paritytech/substrate#13853

@burdges
Copy link

burdges commented Apr 11, 2023

We'll no-show when this happens I guess. We should double check what happens when we run out of tranches though..

We run out of tranches when all validators are checking anyways, so de facto it should collapse into 2/3 votes at some point before this. We need roughly one no-show per tranche though, but each tranche consumes 2.25 of the checkers, ignoring the zeroth tranche, so we've exhausted like 1 / 2.25 0.4444... of the validator set as no-shows, so nothing could be approved anyways.

@Sophia-Gold Sophia-Gold transferred this issue from paritytech/polkadot Aug 24, 2023
@the-right-joyce the-right-joyce added C1-mentor A task where a mentor is available. Please indicate in the issue who the mentor could be. T8-parachains_engineering and removed J7-mentor labels Aug 25, 2023
@eagr
Copy link
Contributor

eagr commented Nov 3, 2023

Try to make sense of this this. Please correct me if I'm mistaken.:) @mrcnski

We want to avoid raising WasmError for PVF issues on the Polkadot side, right? But we would want WasmError for related internal issues. Coz that's what turns into a RuntimeConstruction error.

pub unsafe fn create_runtime_from_artifact_bytes(
compiled_artifact_blob: &[u8],
executor_params: &ExecutorParams,
) -> Result<WasmtimeRuntime, WasmError> {
let mut config = DEFAULT_CONFIG.clone();
config.semantics =
params_to_wasmtime_semantics(executor_params).map_err(|err| WasmError::Other(err))?;

For the purpose, this one in particular shouldn't be a WasmError, correct?

@mrcnski
Copy link
Contributor Author

mrcnski commented Nov 5, 2023

Wow, good catch. It's not even a problem with the PVF, but with the executor parameters which come from on-chain. So if somehow there are invalid exec params on-chain, it would break here for everybody.

We do have some checks for exec parameter correctness (do we check that they can be converted to wasmtime semantics?), and beside that there's not much we can do - in the words of @s0me0ne-unkn0wn, if the network wants to commit suicide, it will find a way. So we're kind of borked here anyway, so doesn't make too much difference how we handle this, but we might as well raise a sensible error. Raising an internal error here doesn't make sense because it's not a local issue, but an issue with the chain. Maybe even panicking is appropriate, because it seems like we can only get to this state with a suicided chain (game over anyway), or some developer bug.

We want to avoid raising WasmError for PVF issues on the Polkadot side, right? But we would want WasmError for related internal issues. Coz that's what turns into a RuntimeConstruction error.

Yes, exactly! 🚀

@eagr
Copy link
Contributor

eagr commented Nov 5, 2023

Maybe even panicking is appropriate, because it seems like we can only get to this state with a suicided chain (game over anyway), or some developer bug.

I see. This is much more serious that I thought. I think it justifies to panick either way.

if the network wants to commit suicide, it will find a way.

Nicely said. 🤣

@s0me0ne-unkn0wn
Copy link
Contributor

Note that currently, params_to_wasmtime_semantics() may only return an error if there's a serious bug in the code (the default stack limit is None in the DEFAULT_CONFIG, although the comment at that constant explicitly prohibits changing anything). So I believe it's okay to panic!(), or unreachable!(), or whatever. But I'd go for changing .map_err() to .expect() tbh, it seems to me it's exactly as it should be.

@eagr
Copy link
Contributor

eagr commented Nov 5, 2023

Or, we panick in params_to_wasmtime_semantics()? I don't see a good reason not to exit early. Is there?

@s0me0ne-unkn0wn
Copy link
Contributor

Yes, sounds good, and in that case, params_to_wasmtime_semantics() should always return plain Semantics.

@mrcnski
Copy link
Contributor Author

mrcnski commented Nov 19, 2023

Per the original issue, we now check for runtime construction errors during prechecking, so it is sound to treat this as an internal error in execution. We just have to make sure to do the runtime construction outside of the execution job process. Since that process runs untrusted code, malicious code can return any error they want, so we can't treat any error returned by that process as an internal/local error.

@eagr
Copy link
Contributor

eagr commented Nov 20, 2023

After #2406, what's left to be done?

@mrcnski
Copy link
Contributor Author

mrcnski commented Nov 20, 2023

Hey @eagr! I think just implementing this, i.e. catching this case during execution. And keeping in mind this part:

it is sound to treat this as an internal error in execution. We just have to make sure to do the runtime construction outside of the execution job process.

claravanstaden added a commit to Snowfork/polkadot-sdk that referenced this issue Dec 8, 2023
* Starts with setting up CL in E2E tests

* Progress on beacon network

* Attempts to get script to work

* Simple example for debugging

* Finally got the lodestar dependencies to import.

* Update script with latest `next` code.

* Fix script.

* Progress on private beacon net

* Adds jwt token

* Lodestar setup script

* Updated dependencies

* Try the local setup again.

* Beacon node local testnet function added

* Cleanup

* Fixes

* Config for local net

* Adds constants

* Cleans up errors so main problem is evident

* Pallet config constant progress

* Try something else

* Swap constants based on config.

* Cleanup

* Revert to usize for bitvector.

* Revert changes.

* Revert unnecessary changes.

* Local beacon net testing fixes

* Temp config update to test minimal config

* Testing minimal spec

* Update API endpoints and use public Lodestar Ropsten server for start-services.

* Remove echo.

* Adds config replacement for beacon endpoint

* Finishing off local beacon testnet

* Last bit of cleanup

* Final tweaks

* PR comments

* Reverts config

Co-authored-by: claravanstaden <Cats 4 life!>
@mrcnski
Copy link
Contributor Author

mrcnski commented Jan 9, 2024

@s0me0ne-unkn0wn Copying my concern from #2871:

This seemed sound because we now do runtime construction in prechecking. However, I realized that the exec params might have changed between prechecking and execution, and if they changed in such a way that the runtime construction now fails, abstaining would stall finality.

We don't currently have any exec params that are applicable to runtime construction, but we might in the future?

@s0me0ne-unkn0wn
Copy link
Contributor

@mrcnski Well, on the one hand, it's a valid concern; on the other hand, we talked a lot of times that a new set of executor params should never be more restrictive than the old one was. Currently, we don't have a mechanism to unsee changes in executor params. If a new Wasm extension has been enabled, it's enabled forever; if a new stack size has been enforced, it will never fall down again, it can only grow further (btw, it's a good idea to enforce such things in executor params consistency checks). And we discussed those measures literally because we don't want some PVF already on-chain to cease compiling/instantiating/executing with a new set of params.

In that sense, we're okay here. Unless that rule is broken, nothing bad can happen. If it's broken, well... If the network wants to commit suicide... You know.

@mrcnski
Copy link
Contributor Author

mrcnski commented Jan 10, 2024

Thanks! I understand about more restrictive, but is e.g. changing the instantiation strategy disallowed? Perhaps we discover some awesome new instantiation strategy and want to enact it through exec params. But, if that can't happen then implementing this ticket should be sound.

@s0me0ne-unkn0wn
Copy link
Contributor

I believe if we ever decide to change something internal, like the instantiation strategy, we should only do that after checking all the on-chain PVFs to survive that change (AFAIR @ordian was developing some tooling for that).

@ggwpez ggwpez removed the C1-mentor A task where a mentor is available. Please indicate in the issue who the mentor could be. label Feb 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Backlog
Development

Successfully merging a pull request may close this issue.

6 participants