Skip to content

ci: ensure bootstrap node is ready before starting other nodes#7321

Merged
steviez merged 1 commit intoanza-xyz:masterfrom
yihau:fix-localnet-test
Aug 5, 2025
Merged

ci: ensure bootstrap node is ready before starting other nodes#7321
steviez merged 1 commit intoanza-xyz:masterfrom
yihau:fix-localnet-test

Conversation

@yihau
Copy link
Copy Markdown
Member

@yihau yihau commented Aug 5, 2025

Problem

context: https://discord.com/channels/428295358100013066/560503042458517505/1402093272872128613

localnet test is flaky

Summary of Changes

in the startNodes function, we have a comment that makes sense, but the actual logic doesn’t quite align with it.

let’s say we have three nodes: the bootstrap node, node 1, and node 2. the current logic starts the nodes in order: first the bootstrap node (index 0), then node 1 (index 1), and so on. however, when starting node 1, the code waits for its own init-complete file instead of checking if the bootstrap node (node 0) is ready. I’ve moved the init-complete check to the top of the loop. this way, when we start node 1, we’ll correctly wait for node 0’s init-complete log (the bootstrap node) before proceeding

@yihau yihau marked this pull request as ready for review August 5, 2025 16:11
@yihau yihau requested review from steviez and willhickey August 5, 2025 16:20
Comment thread ci/localnet-sanity.sh
# 1 == bootstrap validator, wait until it boots before starting
# other validators
# wait for bootstrap validator to boot before starting other validators
if [[ "$i" -eq 1 ]]; then
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe move this whole if block above the loop?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would you mind explaining that in a bit more detail?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow either - the loop initializes all validators; we just need to do different stuff for validators > 0

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to speak for what Alex was thinking, but if he means start the bootstrap validator first, separately and then start the rest, I agree with him.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't want to bikeshed this while I'm seeing failures on 50% of my runs though. And would prefer we got something that worked in first.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes i have meant exactly what Rory mentioned - that we set up bootstrap first to avoid specialcasing based on index. but it is good to have CI fixed thank you!

Comment thread ci/localnet-sanity.sh

# 1 == bootstrap validator, wait until it boots before starting
# other validators
# wait for bootstrap validator to boot before starting other validators
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Maybe explicitly mention that bootstrap == index 0.

Also, this might have "worked" at one point. Passing initCompleteFile to waitForNodToInit is tricky because the value (if what integer is attached to the log file) is getting updated in the loop

Maybe we add a new variable to make the intent a little more clear. Something like:

if [[ "$i" -eq 1 ]]; then
    declare bootstrapInitCompleteFile="init-complete-node0.log"
    waitForNodeToInit "$bootstrapInitCompleteFile"
...

Comment thread ci/localnet-sanity.sh
# 1 == bootstrap validator, wait until it boots before starting
# other validators
# wait for bootstrap validator to boot before starting other validators
if [[ "$i" -eq 1 ]]; then
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow either - the loop initializes all validators; we just need to do different stuff for validators > 0

@steviez steviez requested a review from roryharr August 5, 2025 17:13
@steviez
Copy link
Copy Markdown

steviez commented Aug 5, 2025

Added @roryharr for review as he was looking at this test in depth yesterday

Copy link
Copy Markdown

@steviez steviez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given how people have been running into this, I'm fine to merge it as-is

@steviez
Copy link
Copy Markdown

steviez commented Aug 5, 2025

I'm going to merge this as it is probably pretty late for @yihau

@steviez steviez merged commit e0b720b into anza-xyz:master Aug 5, 2025
33 checks passed
@yihau yihau deleted the fix-localnet-test branch August 6, 2025 02:00
@yihau yihau added the v2.3 label Aug 7, 2025
@mergify
Copy link
Copy Markdown

mergify Bot commented Aug 7, 2025

Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis.

mergify Bot pushed a commit that referenced this pull request Aug 7, 2025
steviez pushed a commit that referenced this pull request Aug 7, 2025
…(backport of #7321) (#7382)

ci: ensure bootstrap node is ready before starting other nodes (#7321)

(cherry picked from commit e0b720b)

Co-authored-by: Yihau Chen <yihau.chen@icloud.com>
akhi3030 added a commit to anza-xyz/alpenglow that referenced this pull request Aug 21, 2025
We are seeing flaky tests that hopefully this will help resolve.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants