[management] Add heartbeat to Job endpoint to prevent proxy timeouts by maxpain · Pull Request #5185 · netbirdio/netbird

maxpain · 2026-01-26T17:12:35Z

Summary

Add 30-second heartbeat to the /management.ManagementService/Job streaming endpoint to keep connections alive through reverse proxies

Problem

The Job streaming endpoint blocks indefinitely when there are no pending jobs, causing reverse proxies (Traefik, Cloudflare, Nginx) to timeout the connection after their idle timeout period (typically 60-120 seconds).

While the gRPC server has keepalive configured (HTTP/2 PING frames), reverse proxies measure idle time based on HTTP/2 DATA frames, not PING frames. This causes 504 Gateway Timeout errors for self-hosted deployments behind proxies.

Solution

Add a periodic heartbeat (every 30 seconds) that sends an empty JobRequest to keep the stream active. The client already handles empty messages gracefully — it logs "received unknown or empty job request, skipping" and continues.

Changes

management/server/job/channel.go: Add EventChan() method to expose channel for select
management/internals/shared/grpc/server.go: Modify sendJobsLoop to use ticker + add sendHeartbeat method

Test plan

Verified build passes
Tested with Traefik proxy (timeout issue resolved)
Run existing tests

Fixes #5184

Summary by CodeRabbit

New Features
- Periodic heartbeat messaging added to keep job streams active and reduce unexpected disconnects.
- Streaming loop upgraded for more responsive, event-driven job delivery.
- Exposed a read-only events channel for safer integration with select-based consumers.
Bug Fixes
- Improved error handling and connection lifecycle control for more reliable job delivery.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

The Job streaming endpoint blocks indefinitely when there are no pending jobs, causing reverse proxies (Traefik, Cloudflare, Nginx) to timeout the connection after their idle timeout period (typically 60-120 seconds). While the gRPC server has keepalive configured (HTTP/2 PING frames), reverse proxies measure idle time based on HTTP/2 DATA frames, not PING frames. This causes 504 Gateway Timeout errors for self-hosted deployments behind proxies. This fix adds a 30-second heartbeat that sends an empty JobRequest to keep the stream active. The client already handles empty messages gracefully (logs "received unknown or empty job request, skipping"). Fixes netbirdio#5184

CLAassistant · 2026-01-26T17:12:50Z

All committers have signed the CLA.

coderabbitai · 2026-01-26T17:13:02Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

Adds a 30s heartbeat to the Management gRPC Job stream and exposes the job event channel for non-blocking selects; stream loop now handles heartbeats, context cancellation, and job events via a select-based loop.

Changes

Cohort / File(s)	Summary
gRPC Job stream + heartbeat `management/internals/shared/grpc/server.go`	Introduces `jobStreamHeartbeatInterval = 30s`, replaces the blocking job-send loop with a select-based loop handling context cancel, heartbeat ticker, and incoming job events. Adds `sendHeartbeat(ctx, peerKey, srv) error` to build/encrypt/send an empty `JobRequest` as a heartbeat and updates error handling/reset of heartbeat timer after sends.
Job event channel accessor `management/server/job/channel.go`	Adds `EventChan() <-chan *Event` to expose the internal events channel as a read-only channel for use in select statements.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Server
    participant EventChannel
    participant Ticker

    Ticker->>Server: tick (every 30s)
    Server->>Server: sendHeartbeat(ctx, peerKey, srv)
    Server->>Client: Send encrypted empty JobRequest (heartbeat)
    Client->>Client: receive/process heartbeat

    EventChannel->>Server: job event available
    Server->>Server: read event from EventChan()
    Server->>Client: Send encrypted Job envelope
    Client->>Client: process job

    Client->>Server: may cancel context / disconnect
    Server->>Server: detect context cancellation and return / stop stream

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 I hop a little heartbeat, steady and bright,
every thirty seconds I nudge the night.
No more timeouts that make us sigh,
a tiny pulse keeps connections nigh.
Hooray — the stream stays warm and light! 🎉

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: adding a heartbeat mechanism to the Job endpoint to prevent proxy timeouts.
Description check	✅ Passed	The PR description is comprehensive, covering problem statement, solution, changes made, and test plan. However, the checklist section in the description template is not filled out.
Linked Issues check	✅ Passed	The code changes directly address the linked issue `#5184` by implementing a 30-second heartbeat mechanism to prevent reverse proxy timeouts, resolving 504 Gateway Timeout errors caused by idle HTTP/2 DATA frame tracking.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to the heartbeat implementation objective: the EventChan() method exposes the channel for select-based event handling, and sendJobsLoop/sendHeartbeat implement the heartbeat logic as designed.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

Adds a periodic heartbeat to the /management.ManagementService/Job gRPC streaming endpoint to prevent idle reverse-proxy timeouts (e.g., Traefik/Cloudflare/Nginx) when no jobs are pending.

Changes:

Exposes the underlying job event channel via EventChan() for select-based consumption.
Updates sendJobsLoop to multiplex between job events, context cancellation, and a heartbeat ticker.
Introduces sendHeartbeat to send an encrypted empty JobRequest periodically.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
`management/server/job/channel.go`	Adds `EventChan()` to allow non-blocking `select` over job events.
`management/internals/shared/grpc/server.go`	Implements periodic heartbeats in the Job stream loop and adds `sendHeartbeat()`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@management/internals/shared/grpc/server.go`:
- Around line 392-399: The loop handling updates silences sendJob errors by
returning nil; change it to propagate the actual error just like sendHeartbeat
does: when s.sendJob(ctx, peerKey, event, srv) returns an error, log it and
return that err (not nil) so the gRPC stream surfaces failures to
clients/operators—update the case handling for event := <-updates.EventChan() to
return the returned error from s.sendJob and keep existing logging; ensure the
behavior is consistent with sendHeartbeat and any TODOs around error handling.

- Extract heartbeat interval to named constant (jobStreamHeartbeatInterval) - Return error from sendJob instead of nil for proper error propagation - Reset ticker after successful job send to avoid unnecessary heartbeats - Handle context.Canceled and io.EOF in sendHeartbeat to preserve proper shutdown semantics instead of wrapping as Internal error

sonarqubecloud · 2026-01-26T17:24:04Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

golem131 · 2026-04-15T13:06:10Z

Confirming this bug in production with HAProxy in front of Management

Running NetBird 0.68.3 self-hosted (management + signal + relay + router peer all on 0.68.3). HAProxy 3.3 fronts the management backend over HTTP/2 with alpn h2,http/1.1
and http-reuse always. HAProxy config has timeout server 3600s on the management backend — the commonly recommended value for gRPC.

Symptom is perfectly deterministic:

Router peer container starts at time T.
A management.ManagementService/Sync long-lived server-streaming RPC is established through HAProxy.
At exactly T + 3600s (within ~100 ms), HAProxy closes the stream with termination state sD-- (server-side inactivity timeout, data phase). HAProxy log excerpt:

[dd/Mon/yyyy:hh:mm:ss.sss] https~ netbird_mgmt/... 0/0/0/113/3600114 200 5120 - - sD-- ...
"POST /management.ManagementService/Sync HTTP/2.0"
4. TCP connection from the router to HAProxy stays ESTABLISHED post-event (verified with ss). HAProxy just closed the HTTP/2 stream, not the TCP conn.
5. Peer goes "offline N min ago" in dashboard and never recovers — stays offline indefinitely until netbird service restart on the router. No amount of waiting helps.

The underlying issue appears to be on the client side: the management gRPC Sync client in shared/management/client/grpc.go doesn't recreate the grpc.ClientConn when the
stream is closed by the proxy — it reuses the (now zombie) conn, and new Sync RPCs opened on it either hang or return immediately in a failure loop, so the outer
backoff.Retry never actually recovers. This pattern was fixed for the telemetry flow client in #5750 (close+re-dial on codes.Internal/RST_STREAM, drop WaitForReady,
reset backoff on success) but that fix was not ported to the management Sync client, so any reverse proxy with a finite grpc_read_timeout / timeout server / idle timeout
triggers this.

pappz · 2026-04-15T15:35:48Z

@golem131 The #5750 was created for a special case. When the gRPC stream (not the connection) was broken (i.e., due to a protocol error), the original code attempted to reuse the same connection. In this patch, any stream-related error forces the connection itself to be recreated.

The long-lived reverse proxy configuration is a different topic. The preferred approach is for the protocol to manage keep-alive signals, not the application layer. Of course, if the reverse proxy is not configured for long-lived, silent connections, then it becomes a problem.

We plan to migrate the stream reconnection logic to the Signal and Management servers in the future.

Copilot AI review requested due to automatic review settings January 26, 2026 17:12

Copilot started reviewing on behalf of maxpain January 26, 2026 17:12 View session

Copilot AI reviewed Jan 26, 2026

View reviewed changes

Comment thread management/internals/shared/grpc/server.go

Comment thread management/internals/shared/grpc/server.go Outdated

Comment thread management/internals/shared/grpc/server.go Outdated

coderabbitai Bot reviewed Jan 26, 2026

View reviewed changes

Comment thread management/internals/shared/grpc/server.go Outdated

Copilot AI reviewed Jan 26, 2026

View reviewed changes

maxpain mentioned this pull request Jan 30, 2026

Management gRPC /management.ManagementService/Job intermittently times out (~every 2 minutes) with 504 Gateway Timeout, causing client disconnects / frequent “Network map updated” events #5184

Open

6 tasks

WillSmithTE mentioned this pull request Feb 24, 2026

Management gRPC /management.ManagementService/Job intermittently times out (~every 2 minutes) with 504 Gateway Timeout, causing client disconnects / frequent “Network map updated” events WillSmithTE/netbird#83

Open

6 tasks

Uh oh!

Conversation

maxpain commented Jan 26, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Changes

Test plan

Summary by CodeRabbit

Uh oh!

CLAassistant commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sonarqubecloud Bot commented Jan 26, 2026

Quality Gate passed

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

golem131 commented Apr 15, 2026

Uh oh!

pappz commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

maxpain commented Jan 26, 2026 •

edited by coderabbitai Bot

Loading

CLAassistant commented Jan 26, 2026 •

edited

Loading

coderabbitai Bot commented Jan 26, 2026 •

edited

Loading