Skip to content

fix: healthcheck not responding#6519

Merged
LesnyRumcajs merged 1 commit intomainfrom
fix-healthchecks-slow
Feb 4, 2026
Merged

fix: healthcheck not responding#6519
LesnyRumcajs merged 1 commit intomainfrom
fix-healthchecks-slow

Conversation

@LesnyRumcajs
Copy link
Copy Markdown
Member

@LesnyRumcajs LesnyRumcajs commented Feb 3, 2026

Summary of changes

Changes introduced in this pull request:

  • this should improve the overall performance of the node given shorter critical section for widely used fields,
  • introduced small timeouts for local checks so that the healthcheck is snappy - the default for F3 is 60s and for the simple connection probably something OS-specific - too long!

Reference issue to close (if applicable)

Closes #6515

Other information and links

Change checklist

  • I have performed a self-review of my own code,
  • I have made corresponding changes to the documentation. All new code adheres to the team's documentation standards,
  • I have added tests that prove my fix is effective or that my feature works (if possible),
  • I have made sure the CHANGELOG is up-to-date. All user-facing changes should be reflected in this document.

Outside contributions

  • I have read and agree to the CONTRIBUTING document.
  • I have read and agree to the AI Policy document. I understand that failure to comply with the guidelines will lead to rejection of the pull request.

Summary by CodeRabbit

  • Bug Fixes

    • Added short timeouts to health checks to improve responsiveness and avoid hangs during service delays.
  • Refactor

    • Made synchronization status updates safer and more efficient by switching to a non-mutating update pattern that reduces contention.
    • Improved peer-failure handling with early short-circuiting for known bad peers to reduce overhead and speed recovery.

@LesnyRumcajs LesnyRumcajs requested a review from a team as a code owner February 3, 2026 17:14
@LesnyRumcajs LesnyRumcajs requested review from akaladarshi and hanabi1224 and removed request for a team February 3, 2026 17:14
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Feb 3, 2026

Walkthrough

Refactors sync-status update to an immutable read-clone-compute-write pattern, adds a 2s timeout to health probe operations to avoid stalled checks, and short-circuits peer-failure logging via a pre-check against bad_peers.

Changes

Cohort / File(s) Summary
Chain Sync Refactor
src/chain_sync/sync_status.rs, src/chain_sync/chain_follower.rs
SyncStatusReport::update changed from &mut self to &self -> Self and now returns a new report. chain_follower reads/clones the status, computes a new report, then writes it back (reduces write-lock hold and removes in-place mutation).
Health Check Timeout Protection
src/health/endpoints.rs, Cargo.toml
Adds MAX_REQ_DURATION_SECS = 2 and wraps check_rpc_server_running and check_f3_running calls with tokio::time::timeout to prevent /livez stalls. Minor Cargo change included.
Peer Manager Control Flow
src/libp2p/peer_manager.rs
log_failure now does a read-lock pre-check against bad_peers and returns early for bad peers; per-peer metrics and write-lock updates only run for non-bad peers (changes synchronization/check order).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • akaladarshi
  • hanabi1224
  • elmattic
🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 57.14% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: healthcheck not responding' clearly relates to the primary objective of preventing healthcheck endpoints from stalling, matching the linked issue.
Linked Issues check ✅ Passed The PR addresses key requirements from issue #6515: adds small timeouts to RPC and F3 health checks in endpoints.rs, reduces lock duration in chain_sync modules to prevent stalling, and prevents blocking during healthcheck operations.
Out of Scope Changes check ✅ Passed All changes align with the stated objectives: timeout additions to health checks, lock duration optimizations in sync modules, and peer manager improvements are all within scope of preventing healthcheck stalling.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix-healthchecks-slow

Comment @coderabbitai help to get the list of available commands and usage tips.

@LesnyRumcajs LesnyRumcajs changed the title fix: healthcheck not respodning fix: healthcheck not responding Feb 3, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@src/chain_sync/sync_status.rs`:
- Around line 162-168: The trace log is printing the old field
self.epochs_behind instead of the freshly computed local epochs_behind; update
the log::trace! call in the SyncStatus reporting code to use the local variable
epochs_behind (computed from
network_head_epoch.saturating_sub(current_head_epoch)) so the logged "epochs
behind" reflects the newly computed value rather than the stale
self.epochs_behind.

In `@src/health/endpoints.rs`:
- Around line 174-180: The current timeout check only tests that the timeout did
not elapse, not that F3 actually returned Ok(true); change the condition that
currently uses tokio::time::timeout(...).await.is_ok() to explicitly inspect the
nested Result from tokio::time::timeout and F3IsRunning::is_f3_running() (e.g.,
match or if let to require Ok(Ok(true))) so the branch only runs when the
timeout completed and the inner call returned Ok(true); update the code
surrounding F3IsRunning::is_f3_running() and MAX_REQ_DURATION_SECS accordingly.
- Around line 144-150: The current check uses
tokio::time::timeout(...).await.is_ok(), which only verifies the timeout didn't
elapse and not whether tokio::net::TcpStream::connect actually succeeded; change
the logic around tokio::time::timeout, MAX_REQ_DURATION_SECS, and
state.config.client.rpc_address to await the timeout result and explicitly
handle its nested Result: treat Ok(Ok(_stream)) as a successful connection, and
treat Ok(Err(_err)) or Err(_elapsed) as failures (log/return the RPC-down path
accordingly), so the endpoint only reports "rpc server running" when the connect
call truly succeeded.

Comment thread src/chain_sync/sync_status.rs
Comment thread src/health/endpoints.rs
Comment thread src/health/endpoints.rs
@LesnyRumcajs LesnyRumcajs force-pushed the fix-healthchecks-slow branch 3 times, most recently from a7c7a3d to 43be80a Compare February 3, 2026 17:29
@LesnyRumcajs LesnyRumcajs force-pushed the fix-healthchecks-slow branch from 43be80a to a6bddf0 Compare February 3, 2026 17:29
@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 3, 2026

Codecov Report

❌ Patch coverage is 13.95349% with 37 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.31%. Comparing base (46ac35d) to head (a6bddf0).
⚠️ Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
src/chain_sync/sync_status.rs 0.00% 20 Missing ⚠️
src/libp2p/peer_manager.rs 0.00% 7 Missing ⚠️
src/health/endpoints.rs 50.00% 6 Missing ⚠️
src/chain_sync/chain_follower.rs 0.00% 4 Missing ⚠️
Additional details and impacted files
Files with missing lines Coverage Δ
src/chain_sync/chain_follower.rs 39.54% <0.00%> (-0.13%) ⬇️
src/health/endpoints.rs 89.47% <50.00%> (-2.89%) ⬇️
src/libp2p/peer_manager.rs 22.92% <0.00%> (-0.12%) ⬇️
src/chain_sync/sync_status.rs 12.00% <0.00%> (-1.34%) ⬇️

... and 6 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 46ac35d...a6bddf0. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@LesnyRumcajs LesnyRumcajs added this pull request to the merge queue Feb 4, 2026
Merged via the queue into main with commit 5157235 Feb 4, 2026
34 of 37 checks passed
@LesnyRumcajs LesnyRumcajs deleted the fix-healthchecks-slow branch February 4, 2026 11:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Healthcheck stalling

2 participants