enhance consensus key rotation support #13926

zjma · 2024-07-07T23:03:15Z

Description

Consensus key rotation is not well-supported today

Today, a validator only uses a single consensus key, and the only way to update the consensus key is to edit the node config and restart the node...

when a validator updates consensus pk on chain in epoch e then updates local consensus sk in epoch e+1, incorrect augmented keys for randomness has been persisted at the beginning of epoch e+1, and node will crash on restart (until epoch e+2 when the incorrect data is cleared).
when a validator updates consensus pk on chain in epoch e and updates local consensus sk also in epoch e, the node won't be able to participate in consensus or randomness until epoch e+1.

Proposed user flow

Stop the node
In additional to the old identity, specify the new key in node config.
Restart the node (and the validator should still function with the old key)
Run a txn to update pk for next epoch
When the next epoch arrives, the validator should automatically switch to the new key without any operation/downtime.

The code changes

Add a node config entry: NodeConfig::ConsensusConfig::SafetyRulesConfig::InitialSafetyRulesConfig::FromFile::overriding_identity_paths, which is to hold new validator identitie(s).
Each time the secure storage is (re-)initialized, (typically every time the node restarts), for every validator identity given by the entry above, its private key is saved into the secure storage under key consensus_<PK_HEX> where PK_HEX is the public key hexlified.
At the beginning of a new epoch, all the consensus key-dependent components now find the consensus sk in the secure storage by consensus_<X> where X is its on-chain validator pk hexlified. If they don't exist, fall back to the value under key consensus (set by the existing implementation).
A smoke test consensus_key_rotation to capture the user flow.

Type of Change

New feature
Bug fix
Refactoring

Which Components or Systems Does This Change Impact?

Validator Node

How Has This Been Tested?

A smoke test case.

Key Areas to Review

The entire PR.

Checklist

I have read and followed the CONTRIBUTING doc
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I identified and added all stakeholders and component owners affected by this change as reviewers
I tested both happy and unhappy path of the functionality
I have made corresponding changes to the documentation

trunk-io · 2024-07-07T23:03:19Z

⏱️ 8h 18m total CI duration on this PR

Job	Cumulative Duration	Recent Runs
test-fuzzers	2h 28m	🟩 🟩 🟩 🟩
execution-performance / single-node-performance	1h 28m	🟩 🟩 🟩
forge-e2e-test / forge	44m	🟩 🟩 🟥
forge-compat-test / forge	42m	🟩 🟩 🟩
rust-images / rust-all	35m	🟩 🟩 🟩
forge-framework-upgrade-test / forge	32m	🟩 🟩
execution-performance / test-target-determinator	23m	🟩 🟩 🟩
test-target-determinator	15m	🟩 🟩 🟥
rust-move-tests	15m	🟩
rust-move-tests	14m	🟩
rust-move-tests	14m	🟩
check	13m	🟩 🟩 🟩
general-lints	5m	⬜ 🟩 🟩 🟩
check-dynamic-deps	4m	🟩 🟩 🟩 🟩
semgrep/ci	2m	🟩 🟩 🟩 🟩
file_change_determinator	45s	🟩 🟩 🟩 🟩
file_change_determinator	41s	🟩 🟩 🟩 🟩
file_change_determinator	31s	🟩 🟩 🟩
permission-check	13s	🟩 🟩 🟩
determine-docker-build-metadata	13s	🟩 🟩 🟩
permission-check	13s	🟩 🟩 🟩 🟩
permission-check	11s	🟩 🟩 🟩 🟩
permission-check	10s	🟩 🟩 🟩 🟩
permission-check	9s	🟩 🟩 🟩 🟩
rust-move-tests	8s	⬜

🚨 2 jobs on the last run were significantly faster/slower than expected

Job	Duration	vs 7d avg	Delta
execution-performance / test-target-determinator	7m	5m
execution-performance / single-node-performance	28m	20m

_{settings ⋅ feedback ⋅ docs ⋅ learn more about trunk.io}

crates/aptos-dkg/src/pvss/encryption_dlog.rs

crates/aptos-genesis/src/builder.rs

JoshLind

Code seems reasonable to me.

Only big question is where are we planning to document this flow? It seems pretty challenging to get right without detailed instructions 😄

JoshLind · 2024-08-01T22:52:09Z

testsuite/smoke-test/src/consensus_key_rotation.rs

+            tokio::time::sleep(Duration::from_secs(5)).await;
+            (operator_addr, new_pk, pop, operator_idx)
+        } else {
+            unreachable!()


nit: maybe just unwrap instead of checking and then doing unreachable!()?

Still think the if-let is superfluous 😄

testsuite/smoke-test/src/consensus_key_rotation.rs

consensus/src/consensus_observer/observer.rs

consensus/safety-rules/src/persistent_safety_storage.rs

JoshLind · 2024-08-01T23:04:14Z

config/src/config/safety_rules_config.rs

+        }
+    }
+
+    pub fn overriding_identity_blob_paths_mut(&mut self) -> &mut Vec<PathBuf> {


nit: Maybe make this test-only? It seems like it's only used in the smoke tests? Or better yet, don't expose this here if possible?

fixed.

NOTE that we can't reuse feature testing as crate smoke-test needs it disabled in addition to overriding_identity_blob_paths_mut.

Aah, in that case, don't worry about it -- you can remove the smoke-test feature. I was just wondering if it was a quick change 😄 Seems like overkill for a new feature.

JoshLind · 2024-08-01T23:08:13Z

config/src/config/safety_rules_config.rs

@@ -123,15 +123,22 @@ impl ConfigSanitizer for SafetyRulesConfig {
 pub enum InitialSafetyRulesConfig {
    FromFile {
        identity_blob_path: PathBuf,
+        #[serde(skip_serializing_if = "Vec::is_empty", default)]
+        overriding_identity_paths: Vec<PathBuf>,


Does this need to be a vector? Will 1 path not suffice (if the operator wants to rotate again, they need to move the new key to the old key location, and then add a single override). Would that work?

In theory 2 slots (1 existing, 1 new) is enough. It's just i found n slots can also be supported for free by just replacing Option<PathBuf> with Vec<PathBuf>...

zjma · 2024-08-01T23:18:58Z

Code seems reasonable to me.

Only big question is where are we planning to document this flow? It seems pretty challenging to get right without detailed instructions 😄

@JoshLind i plan to update this after this is merged.

zekun000

looks reasonable, let's replace the debug logs with meaningful logs. we may want to support truncating old keys too but can do it later

consensus/safety-rules/src/safety_rules_manager.rs

zekun000 · 2024-08-15T23:48:56Z

consensus/src/pipeline/execution_client.rs

-                        .expect("Failed in loading consensus key for ExecutionProxyClient.");
-                let signer = Arc::new(ValidatorSigner::new(self.author, consensus_key));
+                let consensus_sk = maybe_consensus_key
+                    .expect("consensus key unavailable for ExecutionProxyClient");


given we just panic if the option doesn't exist, we should probably pass in Arc directly?

we unwrap only if the current node is in the validator set.
For non-validators i guess we can't unwrap outside...

zekun000 · 2024-08-15T23:49:49Z

crates/aptos/src/test/mod.rs

    ) -> CliTypedResult<TransactionSummary> {
        UpdateConsensusKey {
-            txn_options: self.transaction_options(operator_index, None),
+            txn_options: self.transaction_options(operator_index, gas_options),


how's this change related?

needed for a smoke test to be less flaky.

github-actions · 2024-08-19T16:52:38Z

✅ Forge suite `realistic_env_max_load` success on `bf203070be3e07b0a72aa1ab31c1106dfe2ffa70`

two traffics test: inner traffic : committed: 11763.51 txn/s, latency: 3383.08 ms, (p50: 3200 ms, p90: 3800 ms, p99: 11700 ms), latency samples: 4472800
two traffics test : committed: 99.99 txn/s, latency: 2854.56 ms, (p50: 2500 ms, p90: 3700 ms, p99: 8100 ms), latency samples: 1760
Latency breakdown for phase 0: ["QsBatchToPos: max: 0.258, avg: 0.225", "QsPosToProposal: max: 0.481, avg: 0.440", "ConsensusProposalToOrdered: max: 0.345, avg: 0.330", "ConsensusOrderedToCommit: max: 0.917, avg: 0.732", "ConsensusProposalToCommit: max: 1.245, avg: 1.062"]
Max non-epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 1.02s no progress at version 12041 (avg 0.23s) [limit 15].
Max epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 8.73s no progress at version 1080514 (avg 8.22s) [limit 15].
Test Ok

github-actions · 2024-08-19T16:55:46Z

✅ Forge suite `compat` success on `d1bf834728a0cf166d993f4728dfca54f3086fb0` ==> `bf203070be3e07b0a72aa1ab31c1106dfe2ffa70`

Compatibility test results for d1bf834728a0cf166d993f4728dfca54f3086fb0 ==> bf203070be3e07b0a72aa1ab31c1106dfe2ffa70 (PR)
1. Check liveness of validators at old version: d1bf834728a0cf166d993f4728dfca54f3086fb0
compatibility::simple-validator-upgrade::liveness-check : committed: 11601.51 txn/s, latency: 2826.38 ms, (p50: 2400 ms, p90: 4300 ms, p99: 8600 ms), latency samples: 398880
2. Upgrading first Validator to new version: bf203070be3e07b0a72aa1ab31c1106dfe2ffa70
compatibility::simple-validator-upgrade::single-validator-upgrading : committed: 6150.94 txn/s, latency: 4521.50 ms, (p50: 5000 ms, p90: 6000 ms, p99: 6300 ms), latency samples: 124840
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 6822.63 txn/s, latency: 4679.96 ms, (p50: 4800 ms, p90: 6900 ms, p99: 7100 ms), latency samples: 231640
3. Upgrading rest of first batch to new version: bf203070be3e07b0a72aa1ab31c1106dfe2ffa70
compatibility::simple-validator-upgrade::half-validator-upgrading : committed: 7096.29 txn/s, latency: 3918.49 ms, (p50: 4000 ms, p90: 5300 ms, p99: 5600 ms), latency samples: 143780
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 7213.10 txn/s, latency: 4371.04 ms, (p50: 4400 ms, p90: 6800 ms, p99: 7000 ms), latency samples: 242080
4. upgrading second batch to new version: bf203070be3e07b0a72aa1ab31c1106dfe2ffa70
compatibility::simple-validator-upgrade::rest-validator-upgrading : committed: 3831.01 txn/s, latency: 7658.78 ms, (p50: 8200 ms, p90: 14500 ms, p99: 14700 ms), latency samples: 78620
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 9740.86 txn/s, latency: 3297.07 ms, (p50: 2900 ms, p90: 6000 ms, p99: 7700 ms), latency samples: 332520
5. check swarm health
Compatibility test for d1bf834728a0cf166d993f4728dfca54f3086fb0 ==> bf203070be3e07b0a72aa1ab31c1106dfe2ffa70 passed
Test Ok

github-actions · 2024-08-19T16:56:08Z

✅ Forge suite `framework_upgrade` success on `d1bf834728a0cf166d993f4728dfca54f3086fb0` ==> `bf203070be3e07b0a72aa1ab31c1106dfe2ffa70`

Compatibility test results for d1bf834728a0cf166d993f4728dfca54f3086fb0 ==> bf203070be3e07b0a72aa1ab31c1106dfe2ffa70 (PR)
Upgrade the nodes to version: bf203070be3e07b0a72aa1ab31c1106dfe2ffa70
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1308.25 txn/s, submitted: 1311.66 txn/s, failed submission: 3.41 txn/s, expired: 3.41 txn/s, latency: 2360.60 ms, (p50: 2100 ms, p90: 3600 ms, p99: 5300 ms), latency samples: 114980
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1239.97 txn/s, submitted: 1242.66 txn/s, failed submission: 2.69 txn/s, expired: 2.69 txn/s, latency: 2548.55 ms, (p50: 2400 ms, p90: 4200 ms, p99: 6400 ms), latency samples: 110680
5. check swarm health
Compatibility test for d1bf834728a0cf166d993f4728dfca54f3086fb0 ==> bf203070be3e07b0a72aa1ab31c1106dfe2ffa70 passed
Upgrade the remaining nodes to version: bf203070be3e07b0a72aa1ab31c1106dfe2ffa70
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1226.78 txn/s, submitted: 1228.77 txn/s, failed submission: 1.99 txn/s, expired: 1.99 txn/s, latency: 2599.03 ms, (p50: 2400 ms, p90: 4200 ms, p99: 5900 ms), latency samples: 111060
Test Ok

JoshLind

Looks reasonable to me (from the node config and smoke test side). But, I defer the final stamp to @zekun000 for the consensus and validator verifier changes (some of this code is very old 😄).

JoshLind · 2024-08-19T18:50:24Z

config/src/config/safety_rules_config.rs

+        }
+    }
+
+    pub fn overriding_identity_blob_paths_mut(&mut self) -> &mut Vec<PathBuf> {


Aah, in that case, don't worry about it -- you can remove the smoke-test feature. I was just wondering if it was a quick change 😄 Seems like overkill for a new feature.

JoshLind · 2024-08-19T18:51:19Z

testsuite/smoke-test/src/consensus_key_rotation.rs

+        }
+        tokio::time::sleep(Duration::from_secs(1)).await;
+    }
+    bail!("");


nit: maybe put a real error message here? 😄

oops, i forgot auto-merge is on...

JoshLind · 2024-08-19T18:54:35Z

testsuite/smoke-test/src/consensus_key_rotation.rs

+            tokio::time::sleep(Duration::from_secs(5)).await;
+            (operator_addr, new_pk, pop, operator_idx)
+        } else {
+            unreachable!()


Still think the if-let is superfluous 😄

JoshLind · 2024-08-19T18:55:39Z

consensus/safety-rules/src/safety_rules_manager.rs

        }
-    }
-}
+        info!("Overriding key work time: {:?}", timer.elapsed());


nit: do we think this is necessary to track/log?

zjma marked this pull request as ready for review July 7, 2024 23:03

zjma requested review from alinush, rex1fernando, msmouse, lightmark, grao1991, gregnazario, banool, zekun000, sasha8, ibalajiarun and JoshLind as code owners July 7, 2024 23:03

zjma enabled auto-merge (squash) July 7, 2024 23:03

zjma commented Jul 7, 2024

View reviewed changes

crates/aptos-dkg/src/pvss/encryption_dlog.rs Outdated Show resolved Hide resolved

zjma commented Jul 7, 2024

View reviewed changes

crates/aptos-genesis/src/builder.rs Show resolved Hide resolved