Skip to content

[pipeline] fix observer #15764

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Feb 22, 2025
Merged

[pipeline] fix observer #15764

merged 8 commits into from
Feb 22, 2025

Conversation

zekun000
Copy link
Contributor

  • tolerate the case when the parent block's future gets garbage collected (i.e. at epoch boundary)
  • move pipeline creation to be exclusive with sync (either target or fallback)

Copy link

trunk-io bot commented Jan 16, 2025

⏱️ 1h 17m total CI duration on this PR
Slowest 15 Jobs Cumulative Duration Recent Runs
execution-performance / single-node-performance 24m 🟩🟩
rust-targeted-unit-tests 21m 🟩
check-dynamic-deps 10m 🟩🟩🟩
rust-doc-tests 5m 🟩
execution-performance / test-target-determinator 5m 🟩
test-target-determinator 5m 🟩
rust-cargo-deny 4m 🟩🟩
fetch-last-released-docker-image-tag 2m 🟩
semgrep/ci 56s 🟩🟩
general-lints 52s 🟩🟩
file_change_determinator 22s 🟩🟩
file_change_determinator 9s 🟩
permission-check 4s 🟩🟩
permission-check 4s 🟩🟩
permission-check 3s 🟩

🚨 1 job on the last run was significantly faster/slower than expected

Job Duration vs 7d avg Delta
check-dynamic-deps 5m 2m +115%

settingsfeedbackdocs ⋅ learn more about trunk.io

@zekun000 zekun000 requested review from JoshLind and ibalajiarun and removed request for ibalajiarun and sasha8 January 16, 2025 23:28
Copy link
Contributor

@JoshLind JoshLind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks boss!

// Insert the ordered block into the pending blocks
self.ordered_block_store
.lock()
.insert_ordered_block(ordered_block.clone());

// If state sync is not syncing to a commit, finalize the ordered blocks
if !self.state_sync_manager.is_syncing_to_commit() {
if !self.state_sync_manager.is_syncing_to_commit()
&& !self.state_sync_manager.in_fallback_mode()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this is probably not needed. When we enter fallback mode, we terminate all subscriptions, so we never process any CO network messages in fallback mode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm i added this because i saw in logs that both fallback and pipeline are running

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird. Do you have a log trace? When we fallback, we terminate all subscriptions, so any messages received after that should be dropped. Wondering if there's a race condition or something else 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Boss, can you revert this? It shouldn't happen, and if it does, something much bigger is wrong, so I'd rather not hide it 😄

);
return;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JoshLind there should be a return here iiuc? Otherwise, we are executing twice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC and this orders the blocks for execution, then yeah, we should return, otherwise, we'll call finalize_order below and that will execute as well? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we send to buffer manager regardless of new pipeline enabled or not, it still relies on buffer manager to aggregate votes/cert. buffer manager inspects the pipelined block to know whether it's enabled or not

@JoshLind JoshLind added the CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR label Jan 23, 2025

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@zekun000 zekun000 requested a review from gregnazario as a code owner February 7, 2025 22:03

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Copy link
Contributor

@JoshLind JoshLind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest run looks good. 🚀 One node fell into state sync but looks unrelated (bottlenecked).

LGTM, but can we revert the finalize_ordered_block branch check? 😄

// Insert the ordered block into the pending blocks
self.ordered_block_store
.lock()
.insert_ordered_block(ordered_block.clone());

// If state sync is not syncing to a commit, finalize the ordered blocks
if !self.state_sync_manager.is_syncing_to_commit() {
if !self.state_sync_manager.is_syncing_to_commit()
&& !self.state_sync_manager.in_fallback_mode()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Boss, can you revert this? It shouldn't happen, and if it does, something much bigger is wrong, so I'd rather not hide it 😄

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@zekun000 zekun000 enabled auto-merge (rebase) February 21, 2025 23:38

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Copy link
Contributor

✅ Forge suite realistic_env_max_load success on 3462b28c1a4422728022f2cf70841f5a0a0bfd24

two traffics test: inner traffic : committed: 12138.31 txn/s, latency: 3275.38 ms, (p50: 3000 ms, p70: 3300, p90: 4200 ms, p99: 7400 ms), latency samples: 4615240
two traffics test : committed: 99.96 txn/s, latency: 2779.89 ms, (p50: 2100 ms, p70: 2700, p90: 6200 ms, p99: 7900 ms), latency samples: 1840
Latency breakdown for phase 0: ["MempoolToBlockCreation: max: 1.662, avg: 0.981", "ConsensusProposalToOrdered: max: 0.321, avg: 0.308", "ConsensusOrderedToCommit: max: 0.661, avg: 0.541", "ConsensusProposalToCommit: max: 0.964, avg: 0.849"]
Max non-epoch-change gap was: 1 rounds at version 2647895 (avg 0.00) [limit 4], 1.81s no progress at version 2647895 (avg 0.21s) [limit 15].
Max epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 1.25s no progress at version 2642169 (avg 1.25s) [limit 16].
Test Ok

Copy link
Contributor

✅ Forge suite compat success on 51e3afbaf4645c7a8dd03b94e47555c0dbed0366 ==> 3462b28c1a4422728022f2cf70841f5a0a0bfd24

Compatibility test results for 51e3afbaf4645c7a8dd03b94e47555c0dbed0366 ==> 3462b28c1a4422728022f2cf70841f5a0a0bfd24 (PR)
1. Check liveness of validators at old version: 51e3afbaf4645c7a8dd03b94e47555c0dbed0366
compatibility::simple-validator-upgrade::liveness-check : committed: 11949.41 txn/s, latency: 2770.73 ms, (p50: 2900 ms, p70: 3000, p90: 3200 ms, p99: 3500 ms), latency samples: 395740
2. Upgrading first Validator to new version: 3462b28c1a4422728022f2cf70841f5a0a0bfd24
compatibility::simple-validator-upgrade::single-validator-upgrading : committed: 3425.02 txn/s, latency: 8681.58 ms, (p50: 9300 ms, p70: 10400, p90: 11100 ms, p99: 11300 ms), latency samples: 75760
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 3368.53 txn/s, latency: 10243.71 ms, (p50: 11200 ms, p70: 11500, p90: 11700 ms, p99: 11800 ms), latency samples: 127940
3. Upgrading rest of first batch to new version: 3462b28c1a4422728022f2cf70841f5a0a0bfd24
compatibility::simple-validator-upgrade::half-validator-upgrading : committed: 3200.96 txn/s, latency: 9216.85 ms, (p50: 9800 ms, p70: 11200, p90: 12000 ms, p99: 12200 ms), latency samples: 70280
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 3200.45 txn/s, latency: 10774.52 ms, (p50: 11700 ms, p70: 12100, p90: 12400 ms, p99: 12500 ms), latency samples: 125600
4. upgrading second batch to new version: 3462b28c1a4422728022f2cf70841f5a0a0bfd24
compatibility::simple-validator-upgrade::rest-validator-upgrading : committed: 1562.00 txn/s, submitted: 1562.15 txn/s, expired: 0.15 txn/s, latency: 4946.66 ms, (p50: 5700 ms, p70: 6300, p90: 6500 ms, p99: 6600 ms), latency samples: 115429
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 6132.82 txn/s, latency: 5566.60 ms, (p50: 6000 ms, p70: 6300, p90: 6500 ms, p99: 6800 ms), latency samples: 211160
5. check swarm health
Compatibility test for 51e3afbaf4645c7a8dd03b94e47555c0dbed0366 ==> 3462b28c1a4422728022f2cf70841f5a0a0bfd24 passed
Test Ok

Copy link
Contributor

✅ Forge suite framework_upgrade success on 51e3afbaf4645c7a8dd03b94e47555c0dbed0366 ==> 3462b28c1a4422728022f2cf70841f5a0a0bfd24

Compatibility test results for 51e3afbaf4645c7a8dd03b94e47555c0dbed0366 ==> 3462b28c1a4422728022f2cf70841f5a0a0bfd24 (PR)
Upgrade the nodes to version: 3462b28c1a4422728022f2cf70841f5a0a0bfd24
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1432.77 txn/s, submitted: 1438.75 txn/s, failed submission: 5.98 txn/s, expired: 5.98 txn/s, latency: 2115.04 ms, (p50: 1800 ms, p70: 2100, p90: 3300 ms, p99: 5600 ms), latency samples: 129401
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1648.96 txn/s, submitted: 1653.12 txn/s, failed submission: 3.98 txn/s, expired: 4.15 txn/s, latency: 1751.41 ms, (p50: 1500 ms, p70: 1900, p90: 2800 ms, p99: 4500 ms), latency samples: 149304
5. check swarm health
Compatibility test for 51e3afbaf4645c7a8dd03b94e47555c0dbed0366 ==> 3462b28c1a4422728022f2cf70841f5a0a0bfd24 passed
Upgrade the remaining nodes to version: 3462b28c1a4422728022f2cf70841f5a0a0bfd24
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1676.85 txn/s, submitted: 1682.82 txn/s, failed submission: 5.96 txn/s, expired: 5.96 txn/s, latency: 1729.13 ms, (p50: 1500 ms, p70: 1800, p90: 2500 ms, p99: 4100 ms), latency samples: 151861
Test Ok

@zekun000 zekun000 merged commit b608a5d into main Feb 22, 2025
87 of 90 checks passed
@zekun000 zekun000 deleted the zekun/fix branch February 22, 2025 00:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants