Add NETBIRD_SKIP_MIGRATIONS env var to gate AutoMigrate (HA / multi-master)#5925
Add NETBIRD_SKIP_MIGRATIONS env var to gate AutoMigrate (HA / multi-master)#5925dfarrel1 wants to merge 1 commit intonetbirdio:mainfrom
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughAdds conditional skipping of database migrations when the Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
This PR introduces an operator-controlled switch (NETBIRD_SKIP_MIGRATIONS=true) to skip GORM-driven schema migrations, intended to support HA / multi-master Postgres deployments where concurrent DDL on startup is unsafe.
Changes:
- Gate main management DB migrations by wiring
NETBIRD_SKIP_MIGRATIONSintostore.NewStore(...)at boot. - Gate activity/events DB migrations (custom
migrate(...)+AutoMigrate(...)) behindNETBIRD_SKIP_MIGRATIONS. - Additional unrelated runtime behavior changes: request-context cancellation logging, removal of
WithContext(ctx)from several DB queries, and a change to account request buffering/backpressure behavior.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| management/internals/server/boot.go | Passes NETBIRD_SKIP_MIGRATIONS into store initialization (main DB migration gating). |
| management/server/activity/store/sql_store.go | Skips activity store migration + AutoMigrate when NETBIRD_SKIP_MIGRATIONS=true. |
| management/server/telemetry/http_api_metrics.go | Adds context.AfterFunc logging when request context is canceled mid-flight. |
| management/server/store/sql_store.go | Removes .WithContext(ctx) usage from multiple gorm query sites touched in the diff. |
| management/server/account_request_buffer.go | Changes request buffering to block without honoring caller context cancellation. |
Comments suppressed due to low confidence (1)
management/server/store/sql_store.go:5553
getClusterCapabilitynow runs the aggregate query withoutWithContext(ctx), so caller cancellation/timeouts won’t abort the query. If the rest of the store is intended to be request-context-aware, please restores.db.WithContext(ctx)here (and in other affected queries in this PR).
err := s.db.
Model(&proxy.Proxy{}).
Select("COUNT(CASE WHEN "+column+" IS NOT NULL THEN 1 END) > 0 AS has_capability, "+
"COALESCE(MAX(CASE WHEN "+column+" = true THEN 1 ELSE 0 END), 0) = 1 AS any_true").
Where("cluster_address = ? AND status = ? AND last_seen > ?",
clusterAddr, "connected", time.Now().Add(-proxyActiveThreshold)).
Scan(&result).Error
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| handlerDone := make(chan struct{}) | ||
| context.AfterFunc(ctx, func() { | ||
| select { | ||
| case <-handlerDone: | ||
| default: | ||
| log.Debugf("HTTP request context canceled mid-flight: %v %v (reqID=%s, after %v, cause: %v)", | ||
| r.Method, r.URL.Path, reqID, time.Since(reqStart), context.Cause(ctx)) | ||
| } | ||
| }) | ||
|
|
||
| h.ServeHTTP(w, r.WithContext(ctx)) | ||
| close(handlerDone) |
There was a problem hiding this comment.
context.AfterFunc returns a stop function; since you always close handlerDone after ServeHTTP, the callback will still be scheduled and run on normal request completion (after the request context is canceled by net/http), doing a select and returning. Consider capturing the returned stop func and calling it once the handler finishes to avoid the extra callback/goroutine work and to let captured request data be GC’d earlier. Also consider defer close(handlerDone) so handlerDone is closed even if h.ServeHTTP panics, preventing the AfterFunc callback from thinking the handler is still in-flight.
There was a problem hiding this comment.
Stale after rebase — this file is no longer in the PR diff. At review time the branch was behind upstream main, so changes from upstream #5879 (context cancel monitoring) showed as additions here. Current head 00f3b929 is rebased onto current main and the PR scope is only the two migration-gating files.
| ac.getAccountRequestCh <- req | ||
|
|
||
| select { | ||
| case <-ctx.Done(): | ||
| return nil, ctx.Err() | ||
| case ac.getAccountRequestCh <- req: | ||
| } | ||
|
|
||
| select { | ||
| case <-ctx.Done(): | ||
| return nil, ctx.Err() | ||
| case result := <-req.ResultChan: | ||
| log.WithContext(ctx).Tracef("got account with backpressure after %s", time.Since(startTime)) | ||
| return result.Account, result.Err | ||
| } | ||
| result := <-req.ResultChan | ||
| log.WithContext(ctx).Tracef("got account with backpressure after %s", time.Since(startTime)) | ||
| return result.Account, result.Err |
There was a problem hiding this comment.
GetAccountWithBackpressure no longer respects ctx.Done() while sending the request or waiting for the result. Since getAccountRequestCh is unbuffered, this can deadlock if the processing goroutine has exited (e.g., the buffer's background ctx was canceled during shutdown) or if the request is canceled while the send is blocked; similarly, waiting on ResultChan can block forever if processing never completes. Please restore the select { case <-ctx.Done(): ... } handling for both the send and the receive (or otherwise ensure non-blocking behavior under cancellation).
There was a problem hiding this comment.
Stale after rebase — account_request_buffer.go isn't in this PR. At review time the branch was behind upstream main, so unrelated upstream changes showed as additions. Current head 00f3b929 is rebased onto current main and the PR scope is only the two migration-gating files.
| return nil, fmt.Errorf("events auto migrate: %w", err) | ||
| } | ||
| } else { | ||
| log.WithContext(ctx).Info("NETBIRD_SKIP_MIGRATIONS=true; skipping events database migration") |
There was a problem hiding this comment.
When NETBIRD_SKIP_MIGRATIONS=true, this path skips both migrate(...) and db.AutoMigrate(...), but the log message only mentions “skipping events database migration”. Consider updating the message to explicitly mention that AutoMigrate/schema updates are skipped as well, to reduce operator confusion during upgrades.
| log.WithContext(ctx).Info("NETBIRD_SKIP_MIGRATIONS=true; skipping events database migration") | |
| log.WithContext(ctx).Info("NETBIRD_SKIP_MIGRATIONS=true; skipping events database migration and AutoMigrate/schema updates") |
There was a problem hiding this comment.
Already applied in the current head — the log line now reads: NETBIRD_SKIP_MIGRATIONS=true and activity store engine is Postgres; skipping events database migration and AutoMigrate/schema updates. Thanks for the catch.
| if err := s.db.Model(&domain.Domain{}).Count(&total).Error; err != nil { | ||
| return 0, 0, err | ||
| } | ||
| if err := s.db.WithContext(ctx).Model(&domain.Domain{}).Where("validated = ?", true).Count(&validated).Error; err != nil { | ||
| if err := s.db.Model(&domain.Domain{}).Where("validated = ?", true).Count(&validated).Error; err != nil { | ||
| return 0, 0, err |
There was a problem hiding this comment.
This change drops s.db.WithContext(ctx) for the DB queries in this method, meaning request cancellation/deadlines in ctx will no longer be propagated to the database driver for these counts. If that was unintended, please reintroduce WithContext(ctx) (or ensure the store consistently applies context to DB operations).
There was a problem hiding this comment.
Stale after rebase — management/server/store/sql_store.go isn't in this PR. At review time the branch was behind upstream main, so unrelated upstream changes to that file showed as additions. Current head 00f3b929 is rebased onto current main and the PR scope is only the two migration-gating files.
| baseQuery := s.db. | ||
| Model(&accesslogs.AccessLogEntry{}). | ||
| Where(accountIDCondition, accountID) | ||
|
|
||
| baseQuery = s.applyAccessLogFilters(baseQuery, filter) | ||
|
|
||
| if err := baseQuery.Count(&totalCount).Error; err != nil { | ||
| log.WithContext(ctx).Errorf("failed to count access logs: %v", err) | ||
| return nil, 0, status.Errorf(status.Internal, "failed to count access logs") | ||
| } | ||
|
|
||
| query := s.db.WithContext(ctx). | ||
| query := s.db. | ||
| Where(accountIDCondition, accountID) | ||
|
|
||
| query = s.applyAccessLogFilters(query, filter) |
There was a problem hiding this comment.
GetAccountAccessLogs now builds/counts/executes queries without WithContext(ctx), so request cancellation/deadlines won’t interrupt potentially expensive count/search queries. If this was accidental, please restore WithContext(ctx) on baseQuery/query (and similarly in DeleteOldAccessLogs).
There was a problem hiding this comment.
Stale after rebase — management/server/store/sql_store.go isn't in this PR (only management/server/activity/store/sql_store.go is). Review was against an older merge base; current head 00f3b929 scopes the PR to only the two migration-gating files.
| result := s.db.Save(p) | ||
| if result.Error != nil { | ||
| log.WithContext(ctx).Errorf("failed to save proxy: %v", result.Error) | ||
| return status.Errorf(status.Internal, "failed to save proxy") | ||
| } | ||
| return nil | ||
| } | ||
|
|
||
| // UpdateProxyHeartbeat updates the last_seen timestamp for a proxy or creates a new entry if it doesn't exist | ||
| func (s *SqlStore) UpdateProxyHeartbeat(ctx context.Context, proxyID, clusterAddress, ipAddress string) error { | ||
| now := time.Now() | ||
|
|
||
| result := s.db.WithContext(ctx). | ||
| result := s.db. | ||
| Model(&proxy.Proxy{}). | ||
| Where("id = ? AND status = ?", proxyID, "connected"). | ||
| Update("last_seen", now) |
There was a problem hiding this comment.
These proxy persistence queries (SaveProxy / UpdateProxyHeartbeat) no longer attach ctx to the gorm DB handle. That prevents cancellation/timeouts from propagating to the DB driver and can make shutdown/slow-DB scenarios harder to control. Consider restoring s.db.WithContext(ctx) for these operations.
There was a problem hiding this comment.
Stale after rebase — not in this PR's diff. Current head 00f3b929 scopes the PR to only the two migration-gating files.
| store, err := store.NewStore(context.Background(), s.Config.StoreConfig.Engine, s.Config.Datadir, s.Metrics(), os.Getenv("NETBIRD_SKIP_MIGRATIONS") == "true") | ||
| if err != nil { |
There was a problem hiding this comment.
The PR description says the change is limited to gating AutoMigrate at two production call sites, but this PR also includes unrelated behavioral changes in other areas (e.g., request buffering, telemetry middleware, and multiple store query changes). Please either scope the PR down to just the migration gating or update the PR description to cover these additional changes and their rationale, since they materially affect runtime behavior.
There was a problem hiding this comment.
The other files noted (telemetry middleware, account_request_buffer.go, management/server/store/sql_store.go) aren't actually in this PR. They were appearing in the diff because the branch was behind upstream main at review time, so changes from upstream #5879 (context cancel monitoring) showed as additions. Current head 00f3b929 is rebased onto current main; the diff is now scoped to the two migration-gating files as described in the PR body.
| // GetProxyAccessTokenByHashedToken retrieves a proxy access token by its hashed value. | ||
| func (s *SqlStore) GetProxyAccessTokenByHashedToken(ctx context.Context, lockStrength LockingStrength, hashedToken types.HashedProxyToken) (*types.ProxyAccessToken, error) { | ||
| tx := s.db.WithContext(ctx) | ||
| tx := s.db |
There was a problem hiding this comment.
GetProxyAccessTokenByHashedToken no longer uses WithContext(ctx) when building tx. This changes behavior by preventing ctx cancellation/deadlines from aborting the DB query (and differs from code that uses db.WithContext(ctx) in other parts of the repo). Consider restoring tx := s.db.WithContext(ctx) here.
| tx := s.db | |
| tx := s.db.WithContext(ctx) |
There was a problem hiding this comment.
Stale after rebase — not in this PR's diff. Current head 00f3b929 scopes the PR to only the two migration-gating files.
| if result := s.db.Create(token); result.Error != nil { | ||
| return status.Errorf(status.Internal, "save proxy access token: %v", result.Error) | ||
| } | ||
| return nil | ||
| } | ||
|
|
||
| // RevokeProxyAccessToken revokes a proxy access token by its ID. | ||
| func (s *SqlStore) RevokeProxyAccessToken(ctx context.Context, tokenID string) error { | ||
| result := s.db.WithContext(ctx).Model(&types.ProxyAccessToken{}).Where(idQueryCondition, tokenID).Update("revoked", true) | ||
| result := s.db.Model(&types.ProxyAccessToken{}).Where(idQueryCondition, tokenID).Update("revoked", true) | ||
| if result.Error != nil { | ||
| return status.Errorf(status.Internal, "revoke proxy access token: %v", result.Error) | ||
| } | ||
|
|
||
| if result.RowsAffected == 0 { | ||
| return status.Errorf(status.NotFound, "proxy access token not found") | ||
| } | ||
|
|
||
| return nil | ||
| } | ||
|
|
||
| // MarkProxyAccessTokenUsed updates the last used timestamp for a proxy access token. | ||
| func (s *SqlStore) MarkProxyAccessTokenUsed(ctx context.Context, tokenID string) error { | ||
| result := s.db.WithContext(ctx).Model(&types.ProxyAccessToken{}). | ||
| result := s.db.Model(&types.ProxyAccessToken{}). | ||
| Where(idQueryCondition, tokenID). | ||
| Update("last_used", time.Now().UTC()) |
There was a problem hiding this comment.
These proxy access token write operations (Create / Update) also dropped WithContext(ctx), so a canceled request can still block on DB I/O and continue mutating state after the caller is gone. Please consider reattaching ctx via s.db.WithContext(ctx) for these operations as well.
There was a problem hiding this comment.
Stale after rebase — not in this PR's diff. Current head 00f3b929 scopes the PR to only the two migration-gating files.
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@management/server/account_request_buffer.go`:
- Around line 66-70: The code currently does a blocking send to
ac.getAccountRequestCh and a blocking receive from req.ResultChan which ignores
the caller's ctx; change both operations to non-blocking selects that also
listen on ctx.Done(): when enqueuing (ac.getAccountRequestCh <- req) use select
{ case ac.getAccountRequestCh <- req: ... case <-ctx.Done(): return
nil/ctx.Err() } and when waiting for result (result := <-req.ResultChan) use
select { case result := <-req.ResultChan: ... case <-ctx.Done(): return
nil/ctx.Err() } so the function returns promptly on caller cancellation; keep
the Tracef timing/logging only when a real result is received and ensure any
returned error conveys the context cancellation.
In `@management/server/activity/store/sql_store.go`:
- Around line 65-75: The code currently skips migrate() and db.AutoMigrate()
whenever NETBIRD_SKIP_MIGRATIONS=="true", which also prevents table creation for
the default SQLite activity store; change the guard so migrations are only
skipped for the shared Postgres activity store (or when the initialized engine
indicates Postgres). Concretely, use the activity store engine
(NB_ACTIVITY_EVENT_STORE_ENGINE or the variable that holds the chosen engine)
and replace the single env check with a combined condition like: if
NETBIRD_SKIP_MIGRATIONS=="true" && engineIsPostgres { skip } else { run
migrate(ctx, fieldEncrypt, db) and db.AutoMigrate(&activity.Event{},
&activity.DeletedUser{}) }. Ensure references to migrate, db.AutoMigrate,
activity.Event and activity.DeletedUser remain unchanged.
In `@management/server/store/sql_store.go`:
- Around line 1020-1023: The DB calls dropped GORM context propagation by
starting chains from raw s.db; update each affected chain (e.g., the Count calls
using s.db.Model(&domain.Domain{}), and the other locations listed) to begin
with s.db.WithContext(ctx) before adding .Model(...), .Where(...), .Count(...),
.Updates(...), etc.; search for occurrences of "s.db.Model(" and similar chains
in sql_store.go (notably around the domain.Domain{} Count sites and the ranges
called out) and prefix them with WithContext(ctx) so every chain becomes
s.db.WithContext(ctx).Model(...).Where(...).Count/... or
s.db.WithContext(ctx).Where(...).Updates(...) accordingly.
In `@management/server/telemetry/http_api_metrics.go`:
- Around line 186-197: The AfterFunc callback set up with context.AfterFunc is
never stopped, so it will be scheduled for every normal request return; capture
the stopper returned by context.AfterFunc (e.g., stop := context.AfterFunc(ctx,
...)) and invoke stop() when the handler finishes to cancel the pending callback
(call stop() just after h.ServeHTTP(w, r.WithContext(ctx)) and before/after
closing handlerDone), leaving the existing handlerDone close and logging logic
intact; reference symbols: context.AfterFunc, stop (new variable), handlerDone,
h.ServeHTTP, reqID, reqStart, ctx.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 858aa99d-c2d6-4bf7-a97f-d74358dccb4c
📒 Files selected for processing (5)
management/internals/server/boot.gomanagement/server/account_request_buffer.gomanagement/server/activity/store/sql_store.gomanagement/server/store/sql_store.gomanagement/server/telemetry/http_api_metrics.go
| ac.getAccountRequestCh <- req | ||
|
|
||
| select { | ||
| case <-ctx.Done(): | ||
| return nil, ctx.Err() | ||
| case ac.getAccountRequestCh <- req: | ||
| } | ||
|
|
||
| select { | ||
| case <-ctx.Done(): | ||
| return nil, ctx.Err() | ||
| case result := <-req.ResultChan: | ||
| log.WithContext(ctx).Tracef("got account with backpressure after %s", time.Since(startTime)) | ||
| return result.Account, result.Err | ||
| } | ||
| result := <-req.ResultChan | ||
| log.WithContext(ctx).Tracef("got account with backpressure after %s", time.Since(startTime)) | ||
| return result.Account, result.Err |
There was a problem hiding this comment.
Restore cancellation while enqueueing and waiting.
This now ignores the caller’s HTTP/gRPC context. If the buffer loop has exited, Line 66 can block forever; if the request deadline expires while waiting, Line 68 still keeps the caller stuck past cancellation.
Proposed fix
- ac.getAccountRequestCh <- req
-
- result := <-req.ResultChan
- log.WithContext(ctx).Tracef("got account with backpressure after %s", time.Since(startTime))
- return result.Account, result.Err
+ select {
+ case ac.getAccountRequestCh <- req:
+ case <-ctx.Done():
+ return nil, ctx.Err()
+ }
+
+ select {
+ case result := <-req.ResultChan:
+ log.WithContext(ctx).Tracef("got account with backpressure after %s", time.Since(startTime))
+ return result.Account, result.Err
+ case <-ctx.Done():
+ return nil, ctx.Err()
+ }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@management/server/account_request_buffer.go` around lines 66 - 70, The code
currently does a blocking send to ac.getAccountRequestCh and a blocking receive
from req.ResultChan which ignores the caller's ctx; change both operations to
non-blocking selects that also listen on ctx.Done(): when enqueuing
(ac.getAccountRequestCh <- req) use select { case ac.getAccountRequestCh <- req:
... case <-ctx.Done(): return nil/ctx.Err() } and when waiting for result
(result := <-req.ResultChan) use select { case result := <-req.ResultChan: ...
case <-ctx.Done(): return nil/ctx.Err() } so the function returns promptly on
caller cancellation; keep the Tracef timing/logging only when a real result is
received and ensure any returned error conveys the context cancellation.
| if err := s.db.Model(&domain.Domain{}).Count(&total).Error; err != nil { | ||
| return 0, 0, err | ||
| } | ||
| if err := s.db.WithContext(ctx).Model(&domain.Domain{}).Where("validated = ?", true).Count(&validated).Error; err != nil { | ||
| if err := s.db.Model(&domain.Domain{}).Where("validated = ?", true).Count(&validated).Error; err != nil { |
There was a problem hiding this comment.
Restore GORM context propagation for these DB calls.
These methods still accept ctx, but the changed GORM roots now use raw s.db, so request cancellation/deadlines no longer reach the database driver. Please keep the existing pattern by starting each chain with s.db.WithContext(ctx) before adding clauses/filters/updates.
🔎 Verify the changed ranges no longer start GORM chains from raw s.db
Expected result after the fix: no output for these changed ranges.
#!/bin/bash
python - <<'PY'
from pathlib import Path
path = Path("management/server/store/sql_store.go")
lines = path.read_text().splitlines()
ranges = [
(1018, 1026),
(4443, 4459),
(4462, 4475),
(4478, 4483),
(4486, 4497),
(4500, 4513),
(5169, 5182),
(5185, 5197),
(5292, 5340),
(5343, 5354),
(5432, 5439),
(5442, 5468),
(5474, 5489),
(5536, 5557),
(5567, 5584),
]
for start, end in ranges:
for lineno in range(start, end + 1):
line = lines[lineno - 1]
if "s.db" in line and "WithContext(ctx)" not in line:
print(f"{path}:{lineno}: {line.rstrip()}")
PY♻️ Representative fix pattern
- tx := s.db
+ tx := s.db.WithContext(ctx)
- result := s.db.
+ result := s.db.WithContext(ctx).
- if err := s.db.Model(&domain.Domain{}).Count(&total).Error; err != nil {
+ if err := s.db.WithContext(ctx).Model(&domain.Domain{}).Count(&total).Error; err != nil {Also applies to: 4445-4451, 4464-4470, 4480-4480, 4488-4488, 4502-4504, 5171-5177, 5187-5193, 5297-5308, 5345-5347, 5434-5434, 5446-5465, 5478-5482, 5547-5553, 5571-5573
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@management/server/store/sql_store.go` around lines 1020 - 1023, The DB calls
dropped GORM context propagation by starting chains from raw s.db; update each
affected chain (e.g., the Count calls using s.db.Model(&domain.Domain{}), and
the other locations listed) to begin with s.db.WithContext(ctx) before adding
.Model(...), .Where(...), .Count(...), .Updates(...), etc.; search for
occurrences of "s.db.Model(" and similar chains in sql_store.go (notably around
the domain.Domain{} Count sites and the ranges called out) and prefix them with
WithContext(ctx) so every chain becomes
s.db.WithContext(ctx).Model(...).Where(...).Count/... or
s.db.WithContext(ctx).Where(...).Updates(...) accordingly.
| handlerDone := make(chan struct{}) | ||
| context.AfterFunc(ctx, func() { | ||
| select { | ||
| case <-handlerDone: | ||
| default: | ||
| log.Debugf("HTTP request context canceled mid-flight: %v %v (reqID=%s, after %v, cause: %v)", | ||
| r.Method, r.URL.Path, reqID, time.Since(reqStart), context.Cause(ctx)) | ||
| } | ||
| }) | ||
|
|
||
| h.ServeHTTP(w, r.WithContext(ctx)) | ||
| close(handlerDone) |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Go context.AfterFunc documentation returned stop function behavior and whether the callback runs in its own goroutine after ctx is done
💡 Result:
The context.AfterFunc(ctx Context, f func) (stop func bool) function, added in Go 1.21.0, arranges to call f in its own goroutine after ctx is done (canceled or timed out). If ctx is already done when AfterFunc is called, it calls f immediately in its own goroutine. The returned stop function stops the association of ctx with f. Calling stop returns true if it stopped f from being run. It returns false if either: - the context is done and f has already been started in its own goroutine, or - f was already stopped previously. The stop function does not wait for f to complete before returning. Multiple calls to AfterFunc on the same context operate independently. If the context implements an AfterFunc(func) func bool method, it is used instead of the default behavior.
Citations:
- 1: https://pkg.go.dev/context
- 2: https://pkg.go.dev/context@go1.21.5
- 3: https://pkg.go.dev/context@go1.21.4
🏁 Script executed:
# Check if the file exists and examine the specific lines
if [ -f "management/server/telemetry/http_api_metrics.go" ]; then
echo "=== File exists, checking lines 186-197 ==="
sed -n '186,197p' management/server/telemetry/http_api_metrics.go
echo ""
echo "=== Context: lines 180-205 ==="
sed -n '180,205p' management/server/telemetry/http_api_metrics.go
else
echo "File not found. Searching for similar files..."
find . -name "*http_api*" -o -name "*metrics*" | head -20
fiRepository: netbirdio/netbird
Length of output: 1262
🏁 Script executed:
# Also search for context.AfterFunc usage in the codebase
echo "=== Searching for context.AfterFunc usage ==="
rg "context\.AfterFunc" --type go -B2 -A5Repository: netbirdio/netbird
Length of output: 1195
🏁 Script executed:
# Get the function signature and broader context
echo "=== Function containing this code ==="
sed -n '160,210p' management/server/telemetry/http_api_metrics.go | head -60Repository: netbirdio/netbird
Length of output: 1600
🏁 Script executed:
# Find the function name by searching backwards from line 186
rg -B 50 "context\.AfterFunc" management/server/telemetry/http_api_metrics.go | grep "func " | tail -1Repository: netbirdio/netbird
Length of output: 124
🏁 Script executed:
# Check if this middleware is used in request handling
grep -n "HTTPMiddleware\|http_api_metrics" management/server/telemetry/http_api_metrics.go | head -20Repository: netbirdio/netbird
Length of output: 440
🌐 Web query:
Go HTTP server request context cancellation behavior when handler returns
💡 Result:
In Go's net/http package, the request context obtained via http.Request.Context is canceled under three conditions for incoming server requests: when the client's connection closes, when the request is canceled (e.g., via HTTP/2 RST_STREAM), or when the handler's ServeHTTP method returns. This cancellation after the handler returns serves to terminate any background goroutines or lingering operations (like context.AfterFunc callbacks or database connections) that use the request context, preventing resource leaks. Handlers should not rely on the context remaining valid after returning, and any post-handler access to r.Context will observe it as canceled. This behavior is unchanged as of Go 1.25 (latest available documentation).
Citations:
- 1: https://pkg.go.dev/net/http@go1.21.5
- 2: https://pkg.go.dev/net/http
- 3: https://matttproud.com/blog/posts/context-cancellation-and-server-libraries.html
- 4: proposal: net/http: add context cancelation reason for server handlers golang/go#64465
- 5: https://go.dev/src/net/http/request.go
Stop the cancellation callback after normal handler completion.
context.AfterFunc returns a stopper function that prevents the callback from being invoked. Since the HTTP server cancels the request context when the handler returns, this callback will be scheduled for every normal request completion. Although the callback exits immediately via the closed handlerDone channel, this still creates unnecessary per-request goroutine allocation on the hot path. Capturing and calling the stop function prevents this overhead while preserving mid-flight cancellation logging.
♻️ Proposed fix
handlerDone := make(chan struct{})
- context.AfterFunc(ctx, func() {
+ stopCancelLog := context.AfterFunc(ctx, func() {
select {
case <-handlerDone:
default:
log.Debugf("HTTP request context canceled mid-flight: %v %v (reqID=%s, after %v, cause: %v)",
r.Method, r.URL.Path, reqID, time.Since(reqStart), context.Cause(ctx))
}
})
h.ServeHTTP(w, r.WithContext(ctx))
close(handlerDone)
+ stopCancelLog()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| handlerDone := make(chan struct{}) | |
| context.AfterFunc(ctx, func() { | |
| select { | |
| case <-handlerDone: | |
| default: | |
| log.Debugf("HTTP request context canceled mid-flight: %v %v (reqID=%s, after %v, cause: %v)", | |
| r.Method, r.URL.Path, reqID, time.Since(reqStart), context.Cause(ctx)) | |
| } | |
| }) | |
| h.ServeHTTP(w, r.WithContext(ctx)) | |
| close(handlerDone) | |
| handlerDone := make(chan struct{}) | |
| stopCancelLog := context.AfterFunc(ctx, func() { | |
| select { | |
| case <-handlerDone: | |
| default: | |
| log.Debugf("HTTP request context canceled mid-flight: %v %v (reqID=%s, after %v, cause: %v)", | |
| r.Method, r.URL.Path, reqID, time.Since(reqStart), context.Cause(ctx)) | |
| } | |
| }) | |
| h.ServeHTTP(w, r.WithContext(ctx)) | |
| close(handlerDone) | |
| stopCancelLog() |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@management/server/telemetry/http_api_metrics.go` around lines 186 - 197, The
AfterFunc callback set up with context.AfterFunc is never stopped, so it will be
scheduled for every normal request return; capture the stopper returned by
context.AfterFunc (e.g., stop := context.AfterFunc(ctx, ...)) and invoke stop()
when the handler finishes to cancel the pending callback (call stop() just after
h.ServeHTTP(w, r.WithContext(ctx)) and before/after closing handlerDone),
leaving the existing handlerDone close and logging logic intact; reference
symbols: context.AfterFunc, stop (new variable), handlerDone, h.ServeHTTP,
reqID, reqStart, ctx.
1310407 to
a724e2f
Compare
Operators running multi-master Postgres logical replication (Spock / pgactive / BDR) need to control which node runs DDL during a coordinated upgrade. Logical replication replicates DML but not DDL; uncoordinated AutoMigrate across sites races on DDL execution and can break the cluster. This change exposes the existing skipMigration bool parameter on store.NewStore via env var at the production call site (management/internals/server/boot.go), and adds equivalent gating to the activity store's NewSqlStore (which had no parameter at all). Default behavior is unchanged: env var unset -> AutoMigrate runs. Single-instance deployments are entirely unaffected. Geolocation is intentionally not gated (uses SQLite per-instance, not multi-master). Relates to: netbirdio#1584
a724e2f to
00f3b92
Compare
|
|
Addressed the two actionable items from this round of review in the amended commit (force-pushed — previous unrelated commits in the diff were a branch-base issue; the branch was cut from Log message wording (per @copilot): now reads SQLite safety (per @coderabbitai, good catch): the skip is now gated on skipMigrations := os.Getenv("NETBIRD_SKIP_MIGRATIONS") == "true" &&
os.Getenv(storeEngineEnv) == string(types.PostgresStoreEngine)Other comments on |
|
Tip For best results, initiate chat on the files or code changes.
The branch-base explanation also makes sense — rebasing onto current 🧠 Learnings usedIf you found this review helpful, would you consider giving us a shout-out on X? Thank you for using CodeRabbit! |



What
Adds support for an opt-in env var
NETBIRD_SKIP_MIGRATIONS=truethat suppressesgorm.AutoMigrate()calls at the two production call sites:management/internals/server/boot.go— main store (netbirdDB)management/server/activity/store/sql_store.go— activity / events store (netbird_eventsDB)Default behavior is unchanged: env var unset → AutoMigrate runs. Single-instance deployments are entirely unaffected.
The internal
skipMigration boolplumbing already exists instore.NewStoreandstore.NewSqlStore— every production caller hardcodesfalse. This PR exposes the existing hook via env var and adds equivalent gating to the activity store, which had noskipMigrationparameter at all.Why
Operators running multi-master Postgres logical replication (Spock, pgactive, BDR) need to control which node runs DDL during a coordinated upgrade. Logical replication replicates DML but not DDL — if every node runs
AutoMigratesimultaneously on startup of a new version, races on DDL execution and partial-schema replication conflicts can break the cluster.The standard pattern (documented by pgEdge: "Managing DDL Migrations in a Multi-master Database") is to gate ORM auto-migrations behind an env var and run them from one node only during an upgrade window.
This is opt-in. Single-instance Netbird users do not encounter the env var and see no behavior change. Only operators who explicitly choose multi-master (a topology Netbird does not officially support — see #1584) will set it.
Affected paths
management/internals/server/boot.go(1 line + 1 import)management/server/activity/store/sql_store.go(5 lines, 1 log message —osalready imported)Geolocation / GeoIP
management/server/geolocation/database.go:importCsvToSqlitealso callsAutoMigrate, but uses SQLite per-instance for GeoIP CSV import — not shared across sites — so it does not need gating for multi-master.Tests
Existing tests pass
skipMigration: falseexplicitly in calls toNewStore/NewSqlStore. The new env-var read inboot.gois at the production call site only (not test code), so test behavior is preserved. The activity-store gate is internal toNewSqlStore; tests that need to run against a fresh DB simply leaveNETBIRD_SKIP_MIGRATIONSunset (default).Risk
These risks already exist in any multi-master Postgres deployment. This PR gives operators a tool to manage them; it does not create them.
Out of scope
Relates to
Summary by CodeRabbit