Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhance consensus key rotation support #13926

Merged
merged 1 commit into from
Aug 22, 2024
Merged

enhance consensus key rotation support #13926

merged 1 commit into from
Aug 22, 2024

Conversation

zjma
Copy link
Contributor

@zjma zjma commented Jul 7, 2024

Description

Consensus key rotation is not well-supported today

Today, a validator only uses a single consensus key, and the only way to update the consensus key is to edit the node config and restart the node...

  • when a validator updates consensus pk on chain in epoch e then updates local consensus sk in epoch e+1, incorrect augmented keys for randomness has been persisted at the beginning of epoch e+1, and node will crash on restart (until epoch e+2 when the incorrect data is cleared).
  • when a validator updates consensus pk on chain in epoch e and updates local consensus sk also in epoch e, the node won't be able to participate in consensus or randomness until epoch e+1.

Proposed user flow

  • Stop the node
  • In additional to the old identity, specify the new key in node config.
  • Restart the node (and the validator should still function with the old key)
  • Run a txn to update pk for next epoch
  • When the next epoch arrives, the validator should automatically switch to the new key without any operation/downtime.

The code changes

  • Add a node config entry: NodeConfig::ConsensusConfig::SafetyRulesConfig::InitialSafetyRulesConfig::FromFile::overriding_identity_paths, which is to hold new validator identitie(s).
  • Each time the secure storage is (re-)initialized, (typically every time the node restarts), for every validator identity given by the entry above, its private key is saved into the secure storage under key consensus_<PK_HEX> where PK_HEX is the public key hexlified.
  • At the beginning of a new epoch, all the consensus key-dependent components now find the consensus sk in the secure storage by consensus_<X> where X is its on-chain validator pk hexlified. If they don't exist, fall back to the value under key consensus (set by the existing implementation).
  • A smoke test consensus_key_rotation to capture the user flow.

Type of Change

  • New feature
  • Bug fix
  • Refactoring

Which Components or Systems Does This Change Impact?

  • Validator Node

How Has This Been Tested?

A smoke test case.

Key Areas to Review

The entire PR.

Checklist

  • I have read and followed the CONTRIBUTING doc
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I identified and added all stakeholders and component owners affected by this change as reviewers
  • I tested both happy and unhappy path of the functionality
  • I have made corresponding changes to the documentation

Copy link

trunk-io bot commented Jul 7, 2024

⏱️ 8h 18m total CI duration on this PR
Job Cumulative Duration Recent Runs
test-fuzzers 2h 28m 🟩🟩🟩🟩
execution-performance / single-node-performance 1h 28m 🟩🟩🟩
forge-e2e-test / forge 44m 🟩🟩🟥
forge-compat-test / forge 42m 🟩🟩🟩
rust-images / rust-all 35m 🟩🟩🟩
forge-framework-upgrade-test / forge 32m 🟩🟩
execution-performance / test-target-determinator 23m 🟩🟩🟩
test-target-determinator 15m 🟩🟩🟥
rust-move-tests 15m 🟩
rust-move-tests 14m 🟩
rust-move-tests 14m 🟩
check 13m 🟩🟩🟩
general-lints 5m 🟩🟩🟩
check-dynamic-deps 4m 🟩🟩🟩🟩
semgrep/ci 2m 🟩🟩🟩🟩
file_change_determinator 45s 🟩🟩🟩🟩
file_change_determinator 41s 🟩🟩🟩🟩
file_change_determinator 31s 🟩🟩🟩
permission-check 13s 🟩🟩🟩
determine-docker-build-metadata 13s 🟩🟩🟩
permission-check 13s 🟩🟩🟩🟩
permission-check 11s 🟩🟩🟩🟩
permission-check 10s 🟩🟩🟩🟩
permission-check 9s 🟩🟩🟩🟩
rust-move-tests 8s

🚨 2 jobs on the last run were significantly faster/slower than expected

Job Duration vs 7d avg Delta
execution-performance / test-target-determinator 7m 5m +56%
execution-performance / single-node-performance 28m 20m +37%

settingsfeedbackdocs ⋅ learn more about trunk.io

@zjma zjma marked this pull request as ready for review July 7, 2024 23:03
@zjma zjma enabled auto-merge (squash) July 7, 2024 23:03

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@zjma zjma changed the title consensus key rotation support enhance consensus key rotation support Jul 7, 2024

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@zjma zjma requested a review from danielxiangzl August 1, 2024 21:21
Copy link
Contributor

@JoshLind JoshLind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code seems reasonable to me.

Only big question is where are we planning to document this flow? It seems pretty challenging to get right without detailed instructions 😄

tokio::time::sleep(Duration::from_secs(5)).await;
(operator_addr, new_pk, pop, operator_idx)
} else {
unreachable!()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe just unwrap instead of checking and then doing unreachable!()?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still think the if-let is superfluous 😄

testsuite/smoke-test/src/consensus_key_rotation.rs Outdated Show resolved Hide resolved
consensus/src/consensus_observer/observer.rs Outdated Show resolved Hide resolved
consensus/safety-rules/src/persistent_safety_storage.rs Outdated Show resolved Hide resolved
consensus/safety-rules/src/persistent_safety_storage.rs Outdated Show resolved Hide resolved
}
}

pub fn overriding_identity_blob_paths_mut(&mut self) -> &mut Vec<PathBuf> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Maybe make this test-only? It seems like it's only used in the smoke tests? Or better yet, don't expose this here if possible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

NOTE that we can't reuse feature testing as crate smoke-test needs it disabled in addition to overriding_identity_blob_paths_mut.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aah, in that case, don't worry about it -- you can remove the smoke-test feature. I was just wondering if it was a quick change 😄 Seems like overkill for a new feature.

@@ -123,15 +123,22 @@ impl ConfigSanitizer for SafetyRulesConfig {
pub enum InitialSafetyRulesConfig {
FromFile {
identity_blob_path: PathBuf,
#[serde(skip_serializing_if = "Vec::is_empty", default)]
overriding_identity_paths: Vec<PathBuf>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be a vector? Will 1 path not suffice (if the operator wants to rotate again, they need to move the new key to the old key location, and then add a single override). Would that work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory 2 slots (1 existing, 1 new) is enough. It's just i found n slots can also be supported for free by just replacing Option<PathBuf> with Vec<PathBuf>...

@zjma
Copy link
Contributor Author

zjma commented Aug 1, 2024

Code seems reasonable to me.

Only big question is where are we planning to document this flow? It seems pretty challenging to get right without detailed instructions 😄

@JoshLind i plan to update this after this is merged.

Copy link
Contributor

@zekun000 zekun000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks reasonable, let's replace the debug logs with meaningful logs. we may want to support truncating old keys too but can do it later

.expect("Failed in loading consensus key for ExecutionProxyClient.");
let signer = Arc::new(ValidatorSigner::new(self.author, consensus_key));
let consensus_sk = maybe_consensus_key
.expect("consensus key unavailable for ExecutionProxyClient");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given we just panic if the option doesn't exist, we should probably pass in Arc directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we unwrap only if the current node is in the validator set.
For non-validators i guess we can't unwrap outside...

) -> CliTypedResult<TransactionSummary> {
UpdateConsensusKey {
txn_options: self.transaction_options(operator_index, None),
txn_options: self.transaction_options(operator_index, gas_options),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how's this change related?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needed for a smoke test to be less flaky.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Copy link
Contributor

✅ Forge suite realistic_env_max_load success on bf203070be3e07b0a72aa1ab31c1106dfe2ffa70

two traffics test: inner traffic : committed: 11763.51 txn/s, latency: 3383.08 ms, (p50: 3200 ms, p90: 3800 ms, p99: 11700 ms), latency samples: 4472800
two traffics test : committed: 99.99 txn/s, latency: 2854.56 ms, (p50: 2500 ms, p90: 3700 ms, p99: 8100 ms), latency samples: 1760
Latency breakdown for phase 0: ["QsBatchToPos: max: 0.258, avg: 0.225", "QsPosToProposal: max: 0.481, avg: 0.440", "ConsensusProposalToOrdered: max: 0.345, avg: 0.330", "ConsensusOrderedToCommit: max: 0.917, avg: 0.732", "ConsensusProposalToCommit: max: 1.245, avg: 1.062"]
Max non-epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 1.02s no progress at version 12041 (avg 0.23s) [limit 15].
Max epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 8.73s no progress at version 1080514 (avg 8.22s) [limit 15].
Test Ok

Copy link
Contributor

✅ Forge suite compat success on d1bf834728a0cf166d993f4728dfca54f3086fb0 ==> bf203070be3e07b0a72aa1ab31c1106dfe2ffa70

Compatibility test results for d1bf834728a0cf166d993f4728dfca54f3086fb0 ==> bf203070be3e07b0a72aa1ab31c1106dfe2ffa70 (PR)
1. Check liveness of validators at old version: d1bf834728a0cf166d993f4728dfca54f3086fb0
compatibility::simple-validator-upgrade::liveness-check : committed: 11601.51 txn/s, latency: 2826.38 ms, (p50: 2400 ms, p90: 4300 ms, p99: 8600 ms), latency samples: 398880
2. Upgrading first Validator to new version: bf203070be3e07b0a72aa1ab31c1106dfe2ffa70
compatibility::simple-validator-upgrade::single-validator-upgrading : committed: 6150.94 txn/s, latency: 4521.50 ms, (p50: 5000 ms, p90: 6000 ms, p99: 6300 ms), latency samples: 124840
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 6822.63 txn/s, latency: 4679.96 ms, (p50: 4800 ms, p90: 6900 ms, p99: 7100 ms), latency samples: 231640
3. Upgrading rest of first batch to new version: bf203070be3e07b0a72aa1ab31c1106dfe2ffa70
compatibility::simple-validator-upgrade::half-validator-upgrading : committed: 7096.29 txn/s, latency: 3918.49 ms, (p50: 4000 ms, p90: 5300 ms, p99: 5600 ms), latency samples: 143780
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 7213.10 txn/s, latency: 4371.04 ms, (p50: 4400 ms, p90: 6800 ms, p99: 7000 ms), latency samples: 242080
4. upgrading second batch to new version: bf203070be3e07b0a72aa1ab31c1106dfe2ffa70
compatibility::simple-validator-upgrade::rest-validator-upgrading : committed: 3831.01 txn/s, latency: 7658.78 ms, (p50: 8200 ms, p90: 14500 ms, p99: 14700 ms), latency samples: 78620
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 9740.86 txn/s, latency: 3297.07 ms, (p50: 2900 ms, p90: 6000 ms, p99: 7700 ms), latency samples: 332520
5. check swarm health
Compatibility test for d1bf834728a0cf166d993f4728dfca54f3086fb0 ==> bf203070be3e07b0a72aa1ab31c1106dfe2ffa70 passed
Test Ok

Copy link
Contributor

✅ Forge suite framework_upgrade success on d1bf834728a0cf166d993f4728dfca54f3086fb0 ==> bf203070be3e07b0a72aa1ab31c1106dfe2ffa70

Compatibility test results for d1bf834728a0cf166d993f4728dfca54f3086fb0 ==> bf203070be3e07b0a72aa1ab31c1106dfe2ffa70 (PR)
Upgrade the nodes to version: bf203070be3e07b0a72aa1ab31c1106dfe2ffa70
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1308.25 txn/s, submitted: 1311.66 txn/s, failed submission: 3.41 txn/s, expired: 3.41 txn/s, latency: 2360.60 ms, (p50: 2100 ms, p90: 3600 ms, p99: 5300 ms), latency samples: 114980
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1239.97 txn/s, submitted: 1242.66 txn/s, failed submission: 2.69 txn/s, expired: 2.69 txn/s, latency: 2548.55 ms, (p50: 2400 ms, p90: 4200 ms, p99: 6400 ms), latency samples: 110680
5. check swarm health
Compatibility test for d1bf834728a0cf166d993f4728dfca54f3086fb0 ==> bf203070be3e07b0a72aa1ab31c1106dfe2ffa70 passed
Upgrade the remaining nodes to version: bf203070be3e07b0a72aa1ab31c1106dfe2ffa70
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1226.78 txn/s, submitted: 1228.77 txn/s, failed submission: 1.99 txn/s, expired: 1.99 txn/s, latency: 2599.03 ms, (p50: 2400 ms, p90: 4200 ms, p99: 5900 ms), latency samples: 111060
Test Ok

Copy link
Contributor

@JoshLind JoshLind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me (from the node config and smoke test side). But, I defer the final stamp to @zekun000 for the consensus and validator verifier changes (some of this code is very old 😄).

}
}

pub fn overriding_identity_blob_paths_mut(&mut self) -> &mut Vec<PathBuf> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aah, in that case, don't worry about it -- you can remove the smoke-test feature. I was just wondering if it was a quick change 😄 Seems like overkill for a new feature.

}
tokio::time::sleep(Duration::from_secs(1)).await;
}
bail!("");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe put a real error message here? 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, i forgot auto-merge is on...

tokio::time::sleep(Duration::from_secs(5)).await;
(operator_addr, new_pk, pop, operator_idx)
} else {
unreachable!()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still think the if-let is superfluous 😄

}
}
}
info!("Overriding key work time: {:?}", timer.elapsed());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: do we think this is necessary to track/log?

@zjma zjma requested a review from zekun000 August 20, 2024 21:38
@zjma zjma merged commit 55ff034 into main Aug 22, 2024
46 checks passed
@zjma zjma deleted the zjma/debug0704 branch August 22, 2024 22:55
@zjma zjma mentioned this pull request Nov 26, 2024
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants