Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Consensus Observer] Add state sync fallback mode. #14955

Merged
merged 3 commits into from
Oct 16, 2024
Merged

Conversation

JoshLind
Copy link
Contributor

@JoshLind JoshLind commented Oct 13, 2024

Description

This PR adds a "state sync fallback" mode to consensus observer (CO). If CO does not make progress for some time (e.g., no increase in synced version for 30 seconds), it will fallback to state sync for a while (e.g., 10 minutes). After which, CO will regain control and attempt to make progress.

The PR offers the following commits:

  1. Rename sync_to() to sync_to_target()
  2. Add the state sync fallback mode to CO. For this, we introduce two new components to help manage state sync interaction: (i) the state sync manager; and (ii) the observer fallback manager.
  3. Enable CO for all validators and VFNs (by default).

Testing Plan

New and existing test infrastructure, including manual verification, e.g., see #14960.

Copy link

trunk-io bot commented Oct 13, 2024

⏱️ 9h 59m total CI duration on this PR
Slowest 15 Jobs Cumulative Duration Recent Runs
execution-performance / single-node-performance 3h 27m 🟩🟥🟩🟩🟩 (+6 more)
execution-performance / test-target-determinator 52m 🟩🟩🟩🟩🟩 (+7 more)
test-target-determinator 52m 🟩🟩🟩🟩🟩 (+7 more)
check 44m 🟩🟩🟩🟩🟩 (+7 more)
rust-cargo-deny 21m 🟩🟩🟩🟩🟩 (+7 more)
fetch-last-released-docker-image-tag 19m 🟩🟩🟩🟩🟩 (+7 more)
check-dynamic-deps 14m 🟩🟩🟩🟩🟩 (+8 more)
rust-move-tests 10m 🟩
rust-move-tests 10m 🟩
rust-move-tests 10m 🟩
semgrep/ci 10m 🟩🟩🟩🟩🟩 (+8 more)
rust-move-tests 10m 🟩
rust-move-tests 10m 🟩
rust-move-tests 10m 🟩
rust-move-tests 10m 🟩

settingsfeedbackdocs ⋅ learn more about trunk.io

@JoshLind JoshLind added the CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR label Oct 13, 2024

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@JoshLind JoshLind force-pushed the duration_sync_2 branch 2 times, most recently from 7175d7b to b0b8985 Compare October 14, 2024 13:58

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@JoshLind JoshLind requested review from bchocho and msmouse October 15, 2024 18:22
@@ -837,8 +949,8 @@ impl ConsensusObserver {
Some(network_message) = consensus_observer_message_receiver.next() => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we reject messages during fallback mode to avoid accumulating too many data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah! We'll continue to pop them off the channel, and then drop them. This is because we unsubscribe from all peers before we fallback, and so when we do message sender verification, it will fail, e.g., here:

// Verify the message is from the peers we've subscribed to

@zekun000
Copy link
Contributor

also the performance looks worse than without CO?

@JoshLind
Copy link
Contributor Author

also the performance looks worse than without CO?

Hmm... the graphs look clean though 🤔 Let me rebase. Maybe a noisy run?

@zekun000
Copy link
Contributor

the graph looks clean, but the e2e latency looks exactly the same as CO off, I was expecting we see a clear win from it, but it may be related to how the test is setup

@JoshLind
Copy link
Contributor Author

JoshLind commented Oct 16, 2024

I was expecting we see a clear win from it, but it may be related to how the test is setup

Aah, I see. If you look at the high priority traffic the latency is around 1.8 seconds, but without CO, it often seems closer to 2.7? I assume the rest of the traffic is bottlenecked somewhere else?

Base automatically changed from duration_sync_1 to main October 16, 2024 01:00

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@JoshLind
Copy link
Contributor Author

JoshLind commented Oct 16, 2024

@zekun000, yeah, after the rebase the results are the same. It shaves off around ~1 second from the high-priority traffic in this setup. The rest must be bottlenecked elsewhere. 🤔

Copy link
Contributor

@zekun000 zekun000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how did you test the fallback mode?

@JoshLind
Copy link
Contributor Author

how did you test the fallback mode?

Aah, take a look at the land blocking run on this PR: #14960. I hacked the code so that CO only processes commit notifications periodically. So, when commits are not processed, the node falls back to this type of syncing. Seems to work okay 😄

@JoshLind JoshLind enabled auto-merge (rebase) October 16, 2024 13:21

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Copy link
Contributor

✅ Forge suite compat success on 7eeba4cd15892717741a614add1afde004c7855f ==> d4452a44c54352523d24ed7fa5d1ef52437c1033

Compatibility test results for 7eeba4cd15892717741a614add1afde004c7855f ==> d4452a44c54352523d24ed7fa5d1ef52437c1033 (PR)
1. Check liveness of validators at old version: 7eeba4cd15892717741a614add1afde004c7855f
compatibility::simple-validator-upgrade::liveness-check : committed: 12760.10 txn/s, latency: 2580.40 ms, (p50: 2100 ms, p70: 2400, p90: 3900 ms, p99: 8500 ms), latency samples: 483460
2. Upgrading first Validator to new version: d4452a44c54352523d24ed7fa5d1ef52437c1033
compatibility::simple-validator-upgrade::single-validator-upgrading : committed: 6648.77 txn/s, latency: 4281.63 ms, (p50: 4900 ms, p70: 5200, p90: 5400 ms, p99: 5500 ms), latency samples: 119780
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 5775.99 txn/s, latency: 5267.61 ms, (p50: 5600 ms, p70: 5700, p90: 5900 ms, p99: 6700 ms), latency samples: 219320
3. Upgrading rest of first batch to new version: d4452a44c54352523d24ed7fa5d1ef52437c1033
compatibility::simple-validator-upgrade::half-validator-upgrading : committed: 6645.30 txn/s, latency: 4257.60 ms, (p50: 4800 ms, p70: 5100, p90: 5300 ms, p99: 5300 ms), latency samples: 124240
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 5796.66 txn/s, latency: 5596.21 ms, (p50: 5900 ms, p70: 6100, p90: 7400 ms, p99: 7800 ms), latency samples: 207360
4. upgrading second batch to new version: d4452a44c54352523d24ed7fa5d1ef52437c1033
compatibility::simple-validator-upgrade::rest-validator-upgrading : committed: 9322.01 txn/s, latency: 2850.31 ms, (p50: 3200 ms, p70: 3300, p90: 3500 ms, p99: 3800 ms), latency samples: 171260
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 9434.18 txn/s, latency: 3236.63 ms, (p50: 3000 ms, p70: 3300, p90: 5700 ms, p99: 6900 ms), latency samples: 313040
5. check swarm health
Compatibility test for 7eeba4cd15892717741a614add1afde004c7855f ==> d4452a44c54352523d24ed7fa5d1ef52437c1033 passed
Test Ok

Copy link
Contributor

✅ Forge suite realistic_env_max_load success on d4452a44c54352523d24ed7fa5d1ef52437c1033

two traffics test: inner traffic : committed: 13847.98 txn/s, submitted: 13848.30 txn/s, expired: 0.32 txn/s, latency: 2871.16 ms, (p50: 2700 ms, p70: 2900, p90: 3000 ms, p99: 4800 ms), latency samples: 5265320
two traffics test : committed: 100.01 txn/s, latency: 2085.96 ms, (p50: 1600 ms, p70: 2400, p90: 2900 ms, p99: 6800 ms), latency samples: 1860
Latency breakdown for phase 0: ["QsBatchToPos: max: 0.235, avg: 0.221", "QsPosToProposal: max: 0.858, avg: 0.808", "ConsensusProposalToOrdered: max: 0.312, avg: 0.303", "ConsensusOrderedToCommit: max: 0.513, avg: 0.495", "ConsensusProposalToCommit: max: 0.817, avg: 0.798"]
Max non-epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.81s no progress at version 1868180 (avg 0.21s) [limit 15].
Max epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 8.46s no progress at version 1868178 (avg 7.97s) [limit 15].
Test Ok

Copy link
Contributor

✅ Forge suite framework_upgrade success on 7eeba4cd15892717741a614add1afde004c7855f ==> d4452a44c54352523d24ed7fa5d1ef52437c1033

Compatibility test results for 7eeba4cd15892717741a614add1afde004c7855f ==> d4452a44c54352523d24ed7fa5d1ef52437c1033 (PR)
Upgrade the nodes to version: d4452a44c54352523d24ed7fa5d1ef52437c1033
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1332.66 txn/s, submitted: 1336.40 txn/s, failed submission: 3.74 txn/s, expired: 3.74 txn/s, latency: 2491.91 ms, (p50: 2100 ms, p70: 2400, p90: 4200 ms, p99: 5600 ms), latency samples: 113980
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1187.04 txn/s, submitted: 1187.71 txn/s, failed submission: 0.67 txn/s, expired: 0.67 txn/s, latency: 2570.23 ms, (p50: 2200 ms, p70: 2700, p90: 4400 ms, p99: 5900 ms), latency samples: 105740
5. check swarm health
Compatibility test for 7eeba4cd15892717741a614add1afde004c7855f ==> d4452a44c54352523d24ed7fa5d1ef52437c1033 passed
Upgrade the remaining nodes to version: d4452a44c54352523d24ed7fa5d1ef52437c1033
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1235.21 txn/s, submitted: 1237.69 txn/s, failed submission: 2.48 txn/s, expired: 2.48 txn/s, latency: 2480.32 ms, (p50: 2300 ms, p70: 2700, p90: 3900 ms, p99: 5400 ms), latency samples: 109680
Test Ok

@JoshLind JoshLind disabled auto-merge October 16, 2024 13:59
@JoshLind JoshLind merged commit 85f5e60 into main Oct 16, 2024
89 checks passed
@JoshLind JoshLind deleted the duration_sync_2 branch October 16, 2024 13:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants