Skip to content

[management] Add heartbeat to Job endpoint to prevent proxy timeouts#5185

Open
maxpain wants to merge 2 commits intonetbirdio:mainfrom
maxpain:fix/job-endpoint-heartbeat
Open

[management] Add heartbeat to Job endpoint to prevent proxy timeouts#5185
maxpain wants to merge 2 commits intonetbirdio:mainfrom
maxpain:fix/job-endpoint-heartbeat

Conversation

@maxpain
Copy link
Copy Markdown

@maxpain maxpain commented Jan 26, 2026

Summary

  • Add 30-second heartbeat to the /management.ManagementService/Job streaming endpoint to keep connections alive through reverse proxies

Problem

The Job streaming endpoint blocks indefinitely when there are no pending jobs, causing reverse proxies (Traefik, Cloudflare, Nginx) to timeout the connection after their idle timeout period (typically 60-120 seconds).

While the gRPC server has keepalive configured (HTTP/2 PING frames), reverse proxies measure idle time based on HTTP/2 DATA frames, not PING frames. This causes 504 Gateway Timeout errors for self-hosted deployments behind proxies.

Solution

Add a periodic heartbeat (every 30 seconds) that sends an empty JobRequest to keep the stream active. The client already handles empty messages gracefully — it logs "received unknown or empty job request, skipping" and continues.

Changes

  • management/server/job/channel.go: Add EventChan() method to expose channel for select
  • management/internals/shared/grpc/server.go: Modify sendJobsLoop to use ticker + add sendHeartbeat method

Test plan

  • Verified build passes
  • Tested with Traefik proxy (timeout issue resolved)
  • Run existing tests

Fixes #5184

Summary by CodeRabbit

  • New Features

    • Periodic heartbeat messaging added to keep job streams active and reduce unexpected disconnects.
    • Streaming loop upgraded for more responsive, event-driven job delivery.
    • Exposed a read-only events channel for safer integration with select-based consumers.
  • Bug Fixes

    • Improved error handling and connection lifecycle control for more reliable job delivery.

✏️ Tip: You can customize this high-level summary in your review settings.

The Job streaming endpoint blocks indefinitely when there are no pending
jobs, causing reverse proxies (Traefik, Cloudflare, Nginx) to timeout
the connection after their idle timeout period (typically 60-120 seconds).

While the gRPC server has keepalive configured (HTTP/2 PING frames),
reverse proxies measure idle time based on HTTP/2 DATA frames, not
PING frames. This causes 504 Gateway Timeout errors for self-hosted
deployments behind proxies.

This fix adds a 30-second heartbeat that sends an empty JobRequest
to keep the stream active. The client already handles empty messages
gracefully (logs "received unknown or empty job request, skipping").

Fixes netbirdio#5184
Copilot AI review requested due to automatic review settings January 26, 2026 17:12
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Jan 26, 2026

CLA assistant check
All committers have signed the CLA.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jan 26, 2026

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

Adds a 30s heartbeat to the Management gRPC Job stream and exposes the job event channel for non-blocking selects; stream loop now handles heartbeats, context cancellation, and job events via a select-based loop.

Changes

Cohort / File(s) Summary
gRPC Job stream + heartbeat
management/internals/shared/grpc/server.go
Introduces jobStreamHeartbeatInterval = 30s, replaces the blocking job-send loop with a select-based loop handling context cancel, heartbeat ticker, and incoming job events. Adds sendHeartbeat(ctx, peerKey, srv) error to build/encrypt/send an empty JobRequest as a heartbeat and updates error handling/reset of heartbeat timer after sends.
Job event channel accessor
management/server/job/channel.go
Adds EventChan() <-chan *Event to expose the internal events channel as a read-only channel for use in select statements.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Server
    participant EventChannel
    participant Ticker

    Ticker->>Server: tick (every 30s)
    Server->>Server: sendHeartbeat(ctx, peerKey, srv)
    Server->>Client: Send encrypted empty JobRequest (heartbeat)
    Client->>Client: receive/process heartbeat

    EventChannel->>Server: job event available
    Server->>Server: read event from EventChan()
    Server->>Client: Send encrypted Job envelope
    Client->>Client: process job

    Client->>Server: may cancel context / disconnect
    Server->>Server: detect context cancellation and return / stop stream
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 I hop a little heartbeat, steady and bright,
every thirty seconds I nudge the night.
No more timeouts that make us sigh,
a tiny pulse keeps connections nigh.
Hooray — the stream stays warm and light! 🎉

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: adding a heartbeat mechanism to the Job endpoint to prevent proxy timeouts.
Description check ✅ Passed The PR description is comprehensive, covering problem statement, solution, changes made, and test plan. However, the checklist section in the description template is not filled out.
Linked Issues check ✅ Passed The code changes directly address the linked issue #5184 by implementing a 30-second heartbeat mechanism to prevent reverse proxy timeouts, resolving 504 Gateway Timeout errors caused by idle HTTP/2 DATA frame tracking.
Out of Scope Changes check ✅ Passed All changes are directly scoped to the heartbeat implementation objective: the EventChan() method exposes the channel for select-based event handling, and sendJobsLoop/sendHeartbeat implement the heartbeat logic as designed.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a periodic heartbeat to the /management.ManagementService/Job gRPC streaming endpoint to prevent idle reverse-proxy timeouts (e.g., Traefik/Cloudflare/Nginx) when no jobs are pending.

Changes:

  • Exposes the underlying job event channel via EventChan() for select-based consumption.
  • Updates sendJobsLoop to multiplex between job events, context cancellation, and a heartbeat ticker.
  • Introduces sendHeartbeat to send an encrypted empty JobRequest periodically.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
management/server/job/channel.go Adds EventChan() to allow non-blocking select over job events.
management/internals/shared/grpc/server.go Implements periodic heartbeats in the Job stream loop and adds sendHeartbeat().

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread management/internals/shared/grpc/server.go
Comment thread management/internals/shared/grpc/server.go Outdated
Comment thread management/internals/shared/grpc/server.go Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@management/internals/shared/grpc/server.go`:
- Around line 392-399: The loop handling updates silences sendJob errors by
returning nil; change it to propagate the actual error just like sendHeartbeat
does: when s.sendJob(ctx, peerKey, event, srv) returns an error, log it and
return that err (not nil) so the gRPC stream surfaces failures to
clients/operators—update the case handling for event := <-updates.EventChan() to
return the returned error from s.sendJob and keep existing logging; ensure the
behavior is consistent with sendHeartbeat and any TODOs around error handling.

Comment thread management/internals/shared/grpc/server.go Outdated
- Extract heartbeat interval to named constant (jobStreamHeartbeatInterval)
- Return error from sendJob instead of nil for proper error propagation
- Reset ticker after successful job send to avoid unnecessary heartbeats
- Handle context.Canceled and io.EOF in sendHeartbeat to preserve
  proper shutdown semantics instead of wrapping as Internal error
@sonarqubecloud
Copy link
Copy Markdown

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@golem131
Copy link
Copy Markdown

Confirming this bug in production with HAProxy in front of Management

Running NetBird 0.68.3 self-hosted (management + signal + relay + router peer all on 0.68.3). HAProxy 3.3 fronts the management backend over HTTP/2 with alpn h2,http/1.1
and http-reuse always. HAProxy config has timeout server 3600s on the management backend — the commonly recommended value for gRPC.

Symptom is perfectly deterministic:

  1. Router peer container starts at time T.
  2. A management.ManagementService/Sync long-lived server-streaming RPC is established through HAProxy.
  3. At exactly T + 3600s (within ~100 ms), HAProxy closes the stream with termination state sD-- (server-side inactivity timeout, data phase). HAProxy log excerpt:

[dd/Mon/yyyy:hh:mm:ss.sss] https~ netbird_mgmt/... 0/0/0/113/3600114 200 5120 - - sD-- ...
"POST /management.ManagementService/Sync HTTP/2.0"
4. TCP connection from the router to HAProxy stays ESTABLISHED post-event (verified with ss). HAProxy just closed the HTTP/2 stream, not the TCP conn.
5. Peer goes "offline N min ago" in dashboard and never recovers — stays offline indefinitely until netbird service restart on the router. No amount of waiting helps.

The underlying issue appears to be on the client side: the management gRPC Sync client in shared/management/client/grpc.go doesn't recreate the grpc.ClientConn when the
stream is closed by the proxy — it reuses the (now zombie) conn, and new Sync RPCs opened on it either hang or return immediately in a failure loop, so the outer
backoff.Retry never actually recovers. This pattern was fixed for the telemetry flow client in #5750 (close+re-dial on codes.Internal/RST_STREAM, drop WaitForReady,
reset backoff on success) but that fix was not ported to the management Sync client, so any reverse proxy with a finite grpc_read_timeout / timeout server / idle timeout
triggers this.

@pappz
Copy link
Copy Markdown
Collaborator

pappz commented Apr 15, 2026

@golem131 The #5750 was created for a special case. When the gRPC stream (not the connection) was broken (i.e., due to a protocol error), the original code attempted to reuse the same connection. In this patch, any stream-related error forces the connection itself to be recreated.

The long-lived reverse proxy configuration is a different topic. The preferred approach is for the protocol to manage keep-alive signals, not the application layer. Of course, if the reverse proxy is not configured for long-lived, silent connections, then it becomes a problem.

We plan to migrate the stream reconnection logic to the Signal and Management servers in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

5 participants