-
Notifications
You must be signed in to change notification settings - Fork 628
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
quick validator key swap #11264
Comments
The plan is to start with:
If looks fine then:
|
2024-05-15 (Wednesday) Update
Changes: Working on extending existing |
@saketh-are Is there anything that we need to do on the networking side? For example announce that that a different node has the validator keys? |
I don't believe any changes will be needed on the networking side. Every node in the network has either 1 or 2 public keys associated with it:
Every node maintains in memory a mapping of the AccountKey -> PeerId relationships it is aware of. There is some logic already implemented which allows the mapping to be updated if a validator key starts to be hosted by a different peer id. The things to make sure of are:
|
2024-05-21 (Tuesday) Update
However, the chain is stuck after the hotswap. Investigating why. |
Part of: #11264 This PR should be no-op: - convert `Signer` and `ValidatorSigner` traits into an enum - wrap `ValidatorSigner` with `MutableConfigValue` `MutableConfigValue` requires to implement `PartialEq` and `Clone` traits. These traits are not object safe, thus cannot be derived for `ValidatorSigner` trait. We need to convert `ValidatorSigner` trait into an enum, similarly `Signer` trait. That's also the solution recommended by Rust: rust-lang/rust#80194 Unfortunately, this change requires a change in enormous number of places, because the existing code mostly used `InMemory(Validator)Signer` directly instead of `dyn (Validator)Signer`. To minimize changes, I added traits like `From<InMemoryValidatorSigner> for ValidatorSigner` so that it suffices to add `.into()` in most cases.
Issue: #11264 This is follow-up to #11372. The actual changes (+test) for #11264 will be done in a third, final PR: #11536. ### Summary This PR should mostly be no-op. It focuses on propagating `MutableConfigValue` for `validator_signer` everywhere. All instances of mutable `validator_signer` are synchronized. In case validator_id only is needed, we propagate `validator_signer` anyway as it contains the current validator info. ### Extra changes - Remove signer as a field and pass to methods instead: `Doomslug`, `InfoHelper`, `ChunkValidator`. - Make some public methods internal where they do not need to be public. - Split `process_ready_orphan_witnesses_and_clean_old` into two functions. - Removed `block_production_started` from `ClientActorInner`. - Add `FrozenValidatorConfig` to make it possible to return a snapshot of `ValidatorConfig`. --------- Co-authored-by: Your Name <[email protected]>
Issue: #11264 This PR build on: * #11372 * #11400 and contains actual changes and test for validator key hot swap. ### Summary - Extend `UpdateableConfig` with validator key. - Update client's mutable validator key when we detect it changed in the updateable config. - Advertise our new validator key through `advertise_tier1_proxies()`. - Add integration test for the new behaviour: - We start with 2 validating nodes (`node0`, `node1`) and 1 non-validating node (`node2`). It is important that the non-validating node tracks all shards, because we do not know which shard it will track when we switch validator keys. - We copy validator key from `node0` to `node2`. - We stop `node0`, then we trigger validator key reload for `node2`. - Now `node2` is a validator, but it figures out as `node0` because it copied validator key from `node0`. - We wait for a couple of epochs and we require that both remaining nodes progress the chain. Both nodes should be synchronised after a few epochs. Test with: ``` cargo build -pneard --features test_features,rosetta_rpc && cargo build -pgenesis-populate -prestaked -pnear-test-contracts && python3 pytest/tests/sanity/validator_switch_key_quick.py ``` #### Extra changes: - Use `MutableValidatorSigner` alias instead of `MutableConfigValue<Option<Arc<ValidatorSigner>>>` - Return `ConfigUpdaterResult` from config updater. - Remove (de)serialization derives for `UpdateableConfigs`. - --------- Co-authored-by: Your Name <[email protected]>
2024-06-21 (Tuesday) Update |
Part of: #11264 This PR should be no-op: - convert `Signer` and `ValidatorSigner` traits into an enum - wrap `ValidatorSigner` with `MutableConfigValue` `MutableConfigValue` requires to implement `PartialEq` and `Clone` traits. These traits are not object safe, thus cannot be derived for `ValidatorSigner` trait. We need to convert `ValidatorSigner` trait into an enum, similarly `Signer` trait. That's also the solution recommended by Rust: rust-lang/rust#80194 Unfortunately, this change requires a change in enormous number of places, because the existing code mostly used `InMemory(Validator)Signer` directly instead of `dyn (Validator)Signer`. To minimize changes, I added traits like `From<InMemoryValidatorSigner> for ValidatorSigner` so that it suffices to add `.into()` in most cases.
Issue: #11264 This is follow-up to #11372. The actual changes (+test) for #11264 will be done in a third, final PR: #11536. This PR should mostly be no-op. It focuses on propagating `MutableConfigValue` for `validator_signer` everywhere. All instances of mutable `validator_signer` are synchronized. In case validator_id only is needed, we propagate `validator_signer` anyway as it contains the current validator info. - Remove signer as a field and pass to methods instead: `Doomslug`, `InfoHelper`, `ChunkValidator`. - Make some public methods internal where they do not need to be public. - Split `process_ready_orphan_witnesses_and_clean_old` into two functions. - Removed `block_production_started` from `ClientActorInner`. - Add `FrozenValidatorConfig` to make it possible to return a snapshot of `ValidatorConfig`. --------- Co-authored-by: Your Name <[email protected]>
Issue: #11264 This PR build on: * #11372 * #11400 and contains actual changes and test for validator key hot swap. ### Summary - Extend `UpdateableConfig` with validator key. - Update client's mutable validator key when we detect it changed in the updateable config. - Advertise our new validator key through `advertise_tier1_proxies()`. - Add integration test for the new behaviour: - We start with 2 validating nodes (`node0`, `node1`) and 1 non-validating node (`node2`). It is important that the non-validating node tracks all shards, because we do not know which shard it will track when we switch validator keys. - We copy validator key from `node0` to `node2`. - We stop `node0`, then we trigger validator key reload for `node2`. - Now `node2` is a validator, but it figures out as `node0` because it copied validator key from `node0`. - We wait for a couple of epochs and we require that both remaining nodes progress the chain. Both nodes should be synchronised after a few epochs. Test with: ``` cargo build -pneard --features test_features,rosetta_rpc && cargo build -pgenesis-populate -prestaked -pnear-test-contracts && python3 pytest/tests/sanity/validator_switch_key_quick.py ``` #### Extra changes: - Use `MutableValidatorSigner` alias instead of `MutableConfigValue<Option<Arc<ValidatorSigner>>>` - Return `ConfigUpdaterResult` from config updater. - Remove (de)serialization derives for `UpdateableConfigs`. - --------- Co-authored-by: Your Name <[email protected]>
Confirmed it works fine in forknet20, works with #11689 too. |
Make it possible to quickly move validator keys from one node to another. The goal is to allow validators to perform common maintenance operations with no downtime.
In a typical scenario a validator needs to restart their node. The reason may be a new neard release, need to update configs or any issue that may be causing the node to misbehave. In such circumstance the node operator runs two nodes - the old one with the validator keys and the new one to get it warm and ready. Once the new node is ready the operator stops the old node, moves the validator keys to the new node and restarts the new node. Unfortunately restarting the node may take some time and this will get worse once the memtrie is release. This issue is to allow to move the validator keys from one node to another quickly.
It's important to make sure that no two nodes have the validator keys at the same time as those would both produce blocks and chunks which can be considered as malicious behaviour.
So far the best ideas to implement this are:
start height
in the future and stop the old node right before that height.The text was updated successfully, but these errors were encountered: