Skip to content

fix(p2p): return Head() early when enough peers confirm the same header#372

Merged
walldiss merged 5 commits intocelestiaorg:mainfrom
walldiss:fix/head-early-return
Mar 9, 2026
Merged

fix(p2p): return Head() early when enough peers confirm the same header#372
walldiss merged 5 commits intocelestiaorg:mainfrom
walldiss:fix/head-early-return

Conversation

@walldiss
Copy link
Member

@walldiss walldiss commented Mar 4, 2026

Summary

  • Head() now tracks received headers by hash in real-time during the collection loop
  • Once a header hash reaches minHeadResponses (2) confirmations, outstanding peer requests are cancelled via the shared request context and the function returns immediately
  • Previously Head() waited for ALL trusted peers to respond or timeout — with many bootstrappers, a single slow peer could delay the entire call close to startup deadlines

Context

Light nodes on mocha fail to start because exchange.Head() waits for all trusted peers (bootstrappers) to respond or timeout before returning. With 7 mocha bootstrappers, if even 1 peer is slow to dial (~18s timeout), the entire Head() call takes ~18s — dangerously close to the 20s startup deadline. The 4 working peers respond within <230ms, so the node should return as soon as it has consensus.

Closes #373
Closes https://linear.app/celestia/issue/DA-1157

@Wondertan
Copy link
Member

How it works now: We request all peers and give them 90% of the deadline to respond. Those who gave responses within the window are judged for the bestHead.

How it works with PR: We await the first 2 responses with the same hash and return asap.

Both should work, and there is a test proving that the existing solution works as well. The difference is that the original solution intentionally tries to get as many responses as possible to maximise security.

What I think actually broke is that the 10% given for the rest of the operation was not enough for whatever else the node was doing after the Head request, leading to ctx deadline.

@walldiss
Copy link
Member Author

walldiss commented Mar 5, 2026

You're right that the current approach intentionally maximizes responses within the 90% window. The problem is exactly what you identified and the remaining 10% isn't enough for what follows.

I think the early return is actually the better approach here. The security threshold (minHeadResponses) is what defines how many agreeing peers we need to trust a head and once that's satisfied, collecting more responses doesn't meaningfully improve security. It just eats into the budget that downstream operations (GetByHeight, syncer init) need to complete within the startup deadline.

So rather than trying to tune the 90/10 split, we should let the security threshold be the only thing that governs when Head returns. Fasterstartup and the same security guarantee.

@Wondertan
Copy link
Member

I tend to agree with you here on relying on the threshold. One thing we should probably do then is to increase the threshold and make it more dynamic based on the number of peers/responses we got.

Additionally, the current code keeps both code paths for fast and best heads. The fast head goes on top of bestHead, and bestHead, in fact, never gets triggered. If we go for one way only, we should update the code to do one thing only.

@walldiss walldiss force-pushed the fix/head-early-return branch from 0d1f33a to 08d2d8f Compare March 6, 2026 15:16
@codecov-commenter
Copy link

Codecov Report

❌ Patch coverage is 89.65517% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 54.90%. Comparing base (aa5c4a6) to head (bf80e89).
⚠️ Report is 9 commits behind head on main.

Files with missing lines Patch % Lines
p2p/exchange.go 88.00% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #372      +/-   ##
==========================================
+ Coverage   52.99%   54.90%   +1.91%     
==========================================
  Files          41       41              
  Lines        4663     4759      +96     
==========================================
+ Hits         2471     2613     +142     
+ Misses       2007     1951      -56     
- Partials      185      195      +10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

walldiss added 5 commits March 9, 2026 18:05
Head() currently waits for ALL trusted peers to respond or timeout before
returning. With many bootstrappers, if even one peer is slow to dial (~18s),
the entire Head() call takes ~18s — dangerously close to typical startup
deadlines.

This change tracks received headers by hash in real-time during the
collection loop. Once a header hash reaches minHeadResponses (2)
confirmations, the function cancels outstanding peer requests via the
shared request context and returns immediately.
@walldiss walldiss force-pushed the fix/head-early-return branch from 6edb3b6 to 3c11a8b Compare March 9, 2026 15:06
Copy link
Member

@renaynay renaynay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic + test LGTM - might want to revisit minHeadResponses threshold @walldiss

@walldiss walldiss enabled auto-merge (squash) March 9, 2026 15:47
Copy link
Member

@renaynay renaynay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok w me

@walldiss walldiss merged commit 9747578 into celestiaorg:main Mar 9, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Light nodes fail to start on mocha: Head() waits for all trusted peers instead of returning on consensus

4 participants