[management] Add heartbeat to Job endpoint to prevent proxy timeouts#5185
[management] Add heartbeat to Job endpoint to prevent proxy timeouts#5185maxpain wants to merge 2 commits intonetbirdio:mainfrom
Conversation
The Job streaming endpoint blocks indefinitely when there are no pending jobs, causing reverse proxies (Traefik, Cloudflare, Nginx) to timeout the connection after their idle timeout period (typically 60-120 seconds). While the gRPC server has keepalive configured (HTTP/2 PING frames), reverse proxies measure idle time based on HTTP/2 DATA frames, not PING frames. This causes 504 Gateway Timeout errors for self-hosted deployments behind proxies. This fix adds a 30-second heartbeat that sends an empty JobRequest to keep the stream active. The client already handles empty messages gracefully (logs "received unknown or empty job request, skipping"). Fixes netbirdio#5184
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. 📝 WalkthroughWalkthroughAdds a 30s heartbeat to the Management gRPC Job stream and exposes the job event channel for non-blocking selects; stream loop now handles heartbeats, context cancellation, and job events via a select-based loop. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Server
participant EventChannel
participant Ticker
Ticker->>Server: tick (every 30s)
Server->>Server: sendHeartbeat(ctx, peerKey, srv)
Server->>Client: Send encrypted empty JobRequest (heartbeat)
Client->>Client: receive/process heartbeat
EventChannel->>Server: job event available
Server->>Server: read event from EventChan()
Server->>Client: Send encrypted Job envelope
Client->>Client: process job
Client->>Server: may cancel context / disconnect
Server->>Server: detect context cancellation and return / stop stream
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
Adds a periodic heartbeat to the /management.ManagementService/Job gRPC streaming endpoint to prevent idle reverse-proxy timeouts (e.g., Traefik/Cloudflare/Nginx) when no jobs are pending.
Changes:
- Exposes the underlying job event channel via
EventChan()forselect-based consumption. - Updates
sendJobsLoopto multiplex between job events, context cancellation, and a heartbeat ticker. - Introduces
sendHeartbeatto send an encrypted emptyJobRequestperiodically.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
management/server/job/channel.go |
Adds EventChan() to allow non-blocking select over job events. |
management/internals/shared/grpc/server.go |
Implements periodic heartbeats in the Job stream loop and adds sendHeartbeat(). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@management/internals/shared/grpc/server.go`:
- Around line 392-399: The loop handling updates silences sendJob errors by
returning nil; change it to propagate the actual error just like sendHeartbeat
does: when s.sendJob(ctx, peerKey, event, srv) returns an error, log it and
return that err (not nil) so the gRPC stream surfaces failures to
clients/operators—update the case handling for event := <-updates.EventChan() to
return the returned error from s.sendJob and keep existing logging; ensure the
behavior is consistent with sendHeartbeat and any TODOs around error handling.
- Extract heartbeat interval to named constant (jobStreamHeartbeatInterval) - Return error from sendJob instead of nil for proper error propagation - Reset ticker after successful job send to avoid unnecessary heartbeats - Handle context.Canceled and io.EOF in sendHeartbeat to preserve proper shutdown semantics instead of wrapping as Internal error
|
|
Confirming this bug in production with HAProxy in front of Management Running NetBird 0.68.3 self-hosted (management + signal + relay + router peer all on 0.68.3). HAProxy 3.3 fronts the management backend over HTTP/2 with alpn h2,http/1.1 Symptom is perfectly deterministic:
[dd/Mon/yyyy:hh:mm:ss.sss] https~ netbird_mgmt/... 0/0/0/113/3600114 200 5120 - - sD-- ... The underlying issue appears to be on the client side: the management gRPC Sync client in shared/management/client/grpc.go doesn't recreate the grpc.ClientConn when the |
|
@golem131 The #5750 was created for a special case. When the gRPC stream (not the connection) was broken (i.e., due to a protocol error), the original code attempted to reuse the same connection. In this patch, any stream-related error forces the connection itself to be recreated. The long-lived reverse proxy configuration is a different topic. The preferred approach is for the protocol to manage keep-alive signals, not the application layer. Of course, if the reverse proxy is not configured for long-lived, silent connections, then it becomes a problem. We plan to migrate the stream reconnection logic to the Signal and Management servers in the future. |



Summary
/management.ManagementService/Jobstreaming endpoint to keep connections alive through reverse proxiesProblem
The Job streaming endpoint blocks indefinitely when there are no pending jobs, causing reverse proxies (Traefik, Cloudflare, Nginx) to timeout the connection after their idle timeout period (typically 60-120 seconds).
While the gRPC server has keepalive configured (HTTP/2 PING frames), reverse proxies measure idle time based on HTTP/2 DATA frames, not PING frames. This causes 504 Gateway Timeout errors for self-hosted deployments behind proxies.
Solution
Add a periodic heartbeat (every 30 seconds) that sends an empty
JobRequestto keep the stream active. The client already handles empty messages gracefully — it logs "received unknown or empty job request, skipping" and continues.Changes
management/server/job/channel.go: AddEventChan()method to expose channel for selectmanagement/internals/shared/grpc/server.go: ModifysendJobsLoopto use ticker + addsendHeartbeatmethodTest plan
Fixes #5184
Summary by CodeRabbit
New Features
Bug Fixes
✏️ Tip: You can customize this high-level summary in your review settings.