Skip to content

Add NETBIRD_SKIP_MIGRATIONS env var to gate AutoMigrate (HA / multi-master)#5925

Open
dfarrel1 wants to merge 1 commit intonetbirdio:mainfrom
dfarrel1:skip-migrations-env-var
Open

Add NETBIRD_SKIP_MIGRATIONS env var to gate AutoMigrate (HA / multi-master)#5925
dfarrel1 wants to merge 1 commit intonetbirdio:mainfrom
dfarrel1:skip-migrations-env-var

Conversation

@dfarrel1
Copy link
Copy Markdown

@dfarrel1 dfarrel1 commented Apr 19, 2026

What

Adds support for an opt-in env var NETBIRD_SKIP_MIGRATIONS=true that suppresses gorm.AutoMigrate() calls at the two production call sites:

  • management/internals/server/boot.go — main store (netbird DB)
  • management/server/activity/store/sql_store.go — activity / events store (netbird_events DB)

Default behavior is unchanged: env var unset → AutoMigrate runs. Single-instance deployments are entirely unaffected.

The internal skipMigration bool plumbing already exists in store.NewStore and store.NewSqlStore — every production caller hardcodes false. This PR exposes the existing hook via env var and adds equivalent gating to the activity store, which had no skipMigration parameter at all.

Why

Operators running multi-master Postgres logical replication (Spock, pgactive, BDR) need to control which node runs DDL during a coordinated upgrade. Logical replication replicates DML but not DDL — if every node runs AutoMigrate simultaneously on startup of a new version, races on DDL execution and partial-schema replication conflicts can break the cluster.

The standard pattern (documented by pgEdge: "Managing DDL Migrations in a Multi-master Database") is to gate ORM auto-migrations behind an env var and run them from one node only during an upgrade window.

This is opt-in. Single-instance Netbird users do not encounter the env var and see no behavior change. Only operators who explicitly choose multi-master (a topology Netbird does not officially support — see #1584) will set it.

Affected paths

  • management/internals/server/boot.go (1 line + 1 import)
  • management/server/activity/store/sql_store.go (5 lines, 1 log message — os already imported)

Geolocation / GeoIP

management/server/geolocation/database.go:importCsvToSqlite also calls AutoMigrate, but uses SQLite per-instance for GeoIP CSV import — not shared across sites — so it does not need gating for multi-master.

Tests

Existing tests pass skipMigration: false explicitly in calls to NewStore / NewSqlStore. The new env-var read in boot.go is at the production call site only (not test code), so test behavior is preserved. The activity-store gate is internal to NewSqlStore; tests that need to run against a fresh DB simply leave NETBIRD_SKIP_MIGRATIONS unset (default).

Risk

  • Operator misuse on single-instance: setting the env var on a single-node deployment skips schema upgrades; queries against new columns fail. Mitigation: the activity-store path logs an info message at startup when the flag is active; documentation should call out that this var is only for multi-master operators.
  • Forgotten flag: operator sets it for an upgrade, forgets to unset. Mitigation: the recommended deployment pattern is a wrapper entrypoint that sets the env var per-startup based on a marker file (so persistent env-var state is not the norm). Out of scope for this PR.

These risks already exist in any multi-master Postgres deployment. This PR gives operators a tool to manage them; it does not create them.

Out of scope

  • Documenting multi-master HA. That conversation belongs in Netbird Management HA #1584.
  • A higher-level "upgrade orchestration" feature inside the management server. The env var keeps orchestration responsibility with the operator's tooling (Ansible / k8s / etc.).

Relates to


Summary by CodeRabbit

  • New Features
    • Added support for the NETBIRD_SKIP_MIGRATIONS environment variable: when set to "true", startup will skip database schema migrations (applies to activity store with Postgres engine).
    • Emits an informational log when migrations are intentionally skipped.

Copilot AI review requested due to automatic review settings April 19, 2026 16:03
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 19, 2026

CLA assistant check
All committers have signed the CLA.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 19, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2b1c84ab-2d0f-41a0-891a-b0588fff7ae6

📥 Commits

Reviewing files that changed from the base of the PR and between a724e2f and 00f3b92.

📒 Files selected for processing (2)
  • management/internals/server/boot.go
  • management/server/activity/store/sql_store.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • management/server/activity/store/sql_store.go
  • management/internals/server/boot.go

📝 Walkthrough

Walkthrough

Adds conditional skipping of database migrations when the NETBIRD_SKIP_MIGRATIONS environment variable is set to "true". The server boot path and the activity SQL store both check the variable and avoid running migration/AutoMigrate steps in that case.

Changes

Cohort / File(s) Summary
Server boot / Store init
management/internals/server/boot.go
BaseServer.Store() now reads NETBIRD_SKIP_MIGRATIONS (via os.Getenv) and passes a boolean to store.NewStore(...) indicating whether to skip migrations. Added os import.
Activity store migrations
management/server/activity/store/sql_store.go
NewSqlStore checks NETBIRD_SKIP_MIGRATIONS and, when it equals "true" and engine is PostgresStoreEngine, skips migrate(ctx, ...) and db.AutoMigrate(...), emitting an info log instead of running migrations.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐰 I sniff the env and twitch my nose,

If "true" it says, no migration goes.
Tables nap quiet, no schema alarm,
I hop away happy, keeping the charm. 🥕

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: adding an environment variable to gate AutoMigrate calls for multi-master deployments, which is the core objective of the PR.
Description check ✅ Passed The description is comprehensive and well-structured, covering what was changed, why it was needed, affected paths, tests, risks, and scope clarifications.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces an operator-controlled switch (NETBIRD_SKIP_MIGRATIONS=true) to skip GORM-driven schema migrations, intended to support HA / multi-master Postgres deployments where concurrent DDL on startup is unsafe.

Changes:

  • Gate main management DB migrations by wiring NETBIRD_SKIP_MIGRATIONS into store.NewStore(...) at boot.
  • Gate activity/events DB migrations (custom migrate(...) + AutoMigrate(...)) behind NETBIRD_SKIP_MIGRATIONS.
  • Additional unrelated runtime behavior changes: request-context cancellation logging, removal of WithContext(ctx) from several DB queries, and a change to account request buffering/backpressure behavior.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
management/internals/server/boot.go Passes NETBIRD_SKIP_MIGRATIONS into store initialization (main DB migration gating).
management/server/activity/store/sql_store.go Skips activity store migration + AutoMigrate when NETBIRD_SKIP_MIGRATIONS=true.
management/server/telemetry/http_api_metrics.go Adds context.AfterFunc logging when request context is canceled mid-flight.
management/server/store/sql_store.go Removes .WithContext(ctx) usage from multiple gorm query sites touched in the diff.
management/server/account_request_buffer.go Changes request buffering to block without honoring caller context cancellation.
Comments suppressed due to low confidence (1)

management/server/store/sql_store.go:5553

  • getClusterCapability now runs the aggregate query without WithContext(ctx), so caller cancellation/timeouts won’t abort the query. If the rest of the store is intended to be request-context-aware, please restore s.db.WithContext(ctx) here (and in other affected queries in this PR).
	err := s.db.
		Model(&proxy.Proxy{}).
		Select("COUNT(CASE WHEN "+column+" IS NOT NULL THEN 1 END) > 0 AS has_capability, "+
			"COALESCE(MAX(CASE WHEN "+column+" = true THEN 1 ELSE 0 END), 0) = 1 AS any_true").
		Where("cluster_address = ? AND status = ? AND last_seen > ?",
			clusterAddr, "connected", time.Now().Add(-proxyActiveThreshold)).
		Scan(&result).Error

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +186 to +197
handlerDone := make(chan struct{})
context.AfterFunc(ctx, func() {
select {
case <-handlerDone:
default:
log.Debugf("HTTP request context canceled mid-flight: %v %v (reqID=%s, after %v, cause: %v)",
r.Method, r.URL.Path, reqID, time.Since(reqStart), context.Cause(ctx))
}
})

h.ServeHTTP(w, r.WithContext(ctx))
close(handlerDone)
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

context.AfterFunc returns a stop function; since you always close handlerDone after ServeHTTP, the callback will still be scheduled and run on normal request completion (after the request context is canceled by net/http), doing a select and returning. Consider capturing the returned stop func and calling it once the handler finishes to avoid the extra callback/goroutine work and to let captured request data be GC’d earlier. Also consider defer close(handlerDone) so handlerDone is closed even if h.ServeHTTP panics, preventing the AfterFunc callback from thinking the handler is still in-flight.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale after rebase — this file is no longer in the PR diff. At review time the branch was behind upstream main, so changes from upstream #5879 (context cancel monitoring) showed as additions here. Current head 00f3b929 is rebased onto current main and the PR scope is only the two migration-gating files.

Comment on lines +66 to +70
ac.getAccountRequestCh <- req

select {
case <-ctx.Done():
return nil, ctx.Err()
case ac.getAccountRequestCh <- req:
}

select {
case <-ctx.Done():
return nil, ctx.Err()
case result := <-req.ResultChan:
log.WithContext(ctx).Tracef("got account with backpressure after %s", time.Since(startTime))
return result.Account, result.Err
}
result := <-req.ResultChan
log.WithContext(ctx).Tracef("got account with backpressure after %s", time.Since(startTime))
return result.Account, result.Err
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetAccountWithBackpressure no longer respects ctx.Done() while sending the request or waiting for the result. Since getAccountRequestCh is unbuffered, this can deadlock if the processing goroutine has exited (e.g., the buffer's background ctx was canceled during shutdown) or if the request is canceled while the send is blocked; similarly, waiting on ResultChan can block forever if processing never completes. Please restore the select { case <-ctx.Done(): ... } handling for both the send and the receive (or otherwise ensure non-blocking behavior under cancellation).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale after rebase — account_request_buffer.go isn't in this PR. At review time the branch was behind upstream main, so unrelated upstream changes showed as additions. Current head 00f3b929 is rebased onto current main and the PR scope is only the two migration-gating files.

return nil, fmt.Errorf("events auto migrate: %w", err)
}
} else {
log.WithContext(ctx).Info("NETBIRD_SKIP_MIGRATIONS=true; skipping events database migration")
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When NETBIRD_SKIP_MIGRATIONS=true, this path skips both migrate(...) and db.AutoMigrate(...), but the log message only mentions “skipping events database migration”. Consider updating the message to explicitly mention that AutoMigrate/schema updates are skipped as well, to reduce operator confusion during upgrades.

Suggested change
log.WithContext(ctx).Info("NETBIRD_SKIP_MIGRATIONS=true; skipping events database migration")
log.WithContext(ctx).Info("NETBIRD_SKIP_MIGRATIONS=true; skipping events database migration and AutoMigrate/schema updates")

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already applied in the current head — the log line now reads: NETBIRD_SKIP_MIGRATIONS=true and activity store engine is Postgres; skipping events database migration and AutoMigrate/schema updates. Thanks for the catch.

Comment on lines +1020 to 1024
if err := s.db.Model(&domain.Domain{}).Count(&total).Error; err != nil {
return 0, 0, err
}
if err := s.db.WithContext(ctx).Model(&domain.Domain{}).Where("validated = ?", true).Count(&validated).Error; err != nil {
if err := s.db.Model(&domain.Domain{}).Where("validated = ?", true).Count(&validated).Error; err != nil {
return 0, 0, err
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change drops s.db.WithContext(ctx) for the DB queries in this method, meaning request cancellation/deadlines in ctx will no longer be propagated to the database driver for these counts. If that was unintended, please reintroduce WithContext(ctx) (or ensure the store consistently applies context to DB operations).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale after rebase — management/server/store/sql_store.go isn't in this PR. At review time the branch was behind upstream main, so unrelated upstream changes to that file showed as additions. Current head 00f3b929 is rebased onto current main and the PR scope is only the two migration-gating files.

Comment on lines +5297 to 5311
baseQuery := s.db.
Model(&accesslogs.AccessLogEntry{}).
Where(accountIDCondition, accountID)

baseQuery = s.applyAccessLogFilters(baseQuery, filter)

if err := baseQuery.Count(&totalCount).Error; err != nil {
log.WithContext(ctx).Errorf("failed to count access logs: %v", err)
return nil, 0, status.Errorf(status.Internal, "failed to count access logs")
}

query := s.db.WithContext(ctx).
query := s.db.
Where(accountIDCondition, accountID)

query = s.applyAccessLogFilters(query, filter)
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetAccountAccessLogs now builds/counts/executes queries without WithContext(ctx), so request cancellation/deadlines won’t interrupt potentially expensive count/search queries. If this was accidental, please restore WithContext(ctx) on baseQuery/query (and similarly in DeleteOldAccessLogs).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale after rebase — management/server/store/sql_store.go isn't in this PR (only management/server/activity/store/sql_store.go is). Review was against an older merge base; current head 00f3b929 scopes the PR to only the two migration-gating files.

Comment on lines +5434 to 5449
result := s.db.Save(p)
if result.Error != nil {
log.WithContext(ctx).Errorf("failed to save proxy: %v", result.Error)
return status.Errorf(status.Internal, "failed to save proxy")
}
return nil
}

// UpdateProxyHeartbeat updates the last_seen timestamp for a proxy or creates a new entry if it doesn't exist
func (s *SqlStore) UpdateProxyHeartbeat(ctx context.Context, proxyID, clusterAddress, ipAddress string) error {
now := time.Now()

result := s.db.WithContext(ctx).
result := s.db.
Model(&proxy.Proxy{}).
Where("id = ? AND status = ?", proxyID, "connected").
Update("last_seen", now)
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These proxy persistence queries (SaveProxy / UpdateProxyHeartbeat) no longer attach ctx to the gorm DB handle. That prevents cancellation/timeouts from propagating to the DB driver and can make shutdown/slow-DB scenarios harder to control. Consider restoring s.db.WithContext(ctx) for these operations.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale after rebase — not in this PR's diff. Current head 00f3b929 scopes the PR to only the two migration-gating files.

Comment on lines +64 to 65
store, err := store.NewStore(context.Background(), s.Config.StoreConfig.Engine, s.Config.Datadir, s.Metrics(), os.Getenv("NETBIRD_SKIP_MIGRATIONS") == "true")
if err != nil {
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description says the change is limited to gating AutoMigrate at two production call sites, but this PR also includes unrelated behavioral changes in other areas (e.g., request buffering, telemetry middleware, and multiple store query changes). Please either scope the PR down to just the migration gating or update the PR description to cover these additional changes and their rationale, since they materially affect runtime behavior.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other files noted (telemetry middleware, account_request_buffer.go, management/server/store/sql_store.go) aren't actually in this PR. They were appearing in the diff because the branch was behind upstream main at review time, so changes from upstream #5879 (context cancel monitoring) showed as additions. Current head 00f3b929 is rebased onto current main; the diff is now scoped to the two migration-gating files as described in the PR body.

// GetProxyAccessTokenByHashedToken retrieves a proxy access token by its hashed value.
func (s *SqlStore) GetProxyAccessTokenByHashedToken(ctx context.Context, lockStrength LockingStrength, hashedToken types.HashedProxyToken) (*types.ProxyAccessToken, error) {
tx := s.db.WithContext(ctx)
tx := s.db
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetProxyAccessTokenByHashedToken no longer uses WithContext(ctx) when building tx. This changes behavior by preventing ctx cancellation/deadlines from aborting the DB query (and differs from code that uses db.WithContext(ctx) in other parts of the repo). Consider restoring tx := s.db.WithContext(ctx) here.

Suggested change
tx := s.db
tx := s.db.WithContext(ctx)

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale after rebase — not in this PR's diff. Current head 00f3b929 scopes the PR to only the two migration-gating files.

Comment on lines +4480 to 4504
if result := s.db.Create(token); result.Error != nil {
return status.Errorf(status.Internal, "save proxy access token: %v", result.Error)
}
return nil
}

// RevokeProxyAccessToken revokes a proxy access token by its ID.
func (s *SqlStore) RevokeProxyAccessToken(ctx context.Context, tokenID string) error {
result := s.db.WithContext(ctx).Model(&types.ProxyAccessToken{}).Where(idQueryCondition, tokenID).Update("revoked", true)
result := s.db.Model(&types.ProxyAccessToken{}).Where(idQueryCondition, tokenID).Update("revoked", true)
if result.Error != nil {
return status.Errorf(status.Internal, "revoke proxy access token: %v", result.Error)
}

if result.RowsAffected == 0 {
return status.Errorf(status.NotFound, "proxy access token not found")
}

return nil
}

// MarkProxyAccessTokenUsed updates the last used timestamp for a proxy access token.
func (s *SqlStore) MarkProxyAccessTokenUsed(ctx context.Context, tokenID string) error {
result := s.db.WithContext(ctx).Model(&types.ProxyAccessToken{}).
result := s.db.Model(&types.ProxyAccessToken{}).
Where(idQueryCondition, tokenID).
Update("last_used", time.Now().UTC())
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These proxy access token write operations (Create / Update) also dropped WithContext(ctx), so a canceled request can still block on DB I/O and continue mutating state after the caller is gone. Please consider reattaching ctx via s.db.WithContext(ctx) for these operations as well.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale after rebase — not in this PR's diff. Current head 00f3b929 scopes the PR to only the two migration-gating files.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@management/server/account_request_buffer.go`:
- Around line 66-70: The code currently does a blocking send to
ac.getAccountRequestCh and a blocking receive from req.ResultChan which ignores
the caller's ctx; change both operations to non-blocking selects that also
listen on ctx.Done(): when enqueuing (ac.getAccountRequestCh <- req) use select
{ case ac.getAccountRequestCh <- req: ... case <-ctx.Done(): return
nil/ctx.Err() } and when waiting for result (result := <-req.ResultChan) use
select { case result := <-req.ResultChan: ... case <-ctx.Done(): return
nil/ctx.Err() } so the function returns promptly on caller cancellation; keep
the Tracef timing/logging only when a real result is received and ensure any
returned error conveys the context cancellation.

In `@management/server/activity/store/sql_store.go`:
- Around line 65-75: The code currently skips migrate() and db.AutoMigrate()
whenever NETBIRD_SKIP_MIGRATIONS=="true", which also prevents table creation for
the default SQLite activity store; change the guard so migrations are only
skipped for the shared Postgres activity store (or when the initialized engine
indicates Postgres). Concretely, use the activity store engine
(NB_ACTIVITY_EVENT_STORE_ENGINE or the variable that holds the chosen engine)
and replace the single env check with a combined condition like: if
NETBIRD_SKIP_MIGRATIONS=="true" && engineIsPostgres { skip } else { run
migrate(ctx, fieldEncrypt, db) and db.AutoMigrate(&activity.Event{},
&activity.DeletedUser{}) }. Ensure references to migrate, db.AutoMigrate,
activity.Event and activity.DeletedUser remain unchanged.

In `@management/server/store/sql_store.go`:
- Around line 1020-1023: The DB calls dropped GORM context propagation by
starting chains from raw s.db; update each affected chain (e.g., the Count calls
using s.db.Model(&domain.Domain{}), and the other locations listed) to begin
with s.db.WithContext(ctx) before adding .Model(...), .Where(...), .Count(...),
.Updates(...), etc.; search for occurrences of "s.db.Model(" and similar chains
in sql_store.go (notably around the domain.Domain{} Count sites and the ranges
called out) and prefix them with WithContext(ctx) so every chain becomes
s.db.WithContext(ctx).Model(...).Where(...).Count/... or
s.db.WithContext(ctx).Where(...).Updates(...) accordingly.

In `@management/server/telemetry/http_api_metrics.go`:
- Around line 186-197: The AfterFunc callback set up with context.AfterFunc is
never stopped, so it will be scheduled for every normal request return; capture
the stopper returned by context.AfterFunc (e.g., stop := context.AfterFunc(ctx,
...)) and invoke stop() when the handler finishes to cancel the pending callback
(call stop() just after h.ServeHTTP(w, r.WithContext(ctx)) and before/after
closing handlerDone), leaving the existing handlerDone close and logging logic
intact; reference symbols: context.AfterFunc, stop (new variable), handlerDone,
h.ServeHTTP, reqID, reqStart, ctx.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 858aa99d-c2d6-4bf7-a97f-d74358dccb4c

📥 Commits

Reviewing files that changed from the base of the PR and between 8ae8f20 and 1310407.

📒 Files selected for processing (5)
  • management/internals/server/boot.go
  • management/server/account_request_buffer.go
  • management/server/activity/store/sql_store.go
  • management/server/store/sql_store.go
  • management/server/telemetry/http_api_metrics.go

Comment on lines +66 to +70
ac.getAccountRequestCh <- req

select {
case <-ctx.Done():
return nil, ctx.Err()
case ac.getAccountRequestCh <- req:
}

select {
case <-ctx.Done():
return nil, ctx.Err()
case result := <-req.ResultChan:
log.WithContext(ctx).Tracef("got account with backpressure after %s", time.Since(startTime))
return result.Account, result.Err
}
result := <-req.ResultChan
log.WithContext(ctx).Tracef("got account with backpressure after %s", time.Since(startTime))
return result.Account, result.Err
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Restore cancellation while enqueueing and waiting.

This now ignores the caller’s HTTP/gRPC context. If the buffer loop has exited, Line 66 can block forever; if the request deadline expires while waiting, Line 68 still keeps the caller stuck past cancellation.

Proposed fix
-	ac.getAccountRequestCh <- req
-
-	result := <-req.ResultChan
-	log.WithContext(ctx).Tracef("got account with backpressure after %s", time.Since(startTime))
-	return result.Account, result.Err
+	select {
+	case ac.getAccountRequestCh <- req:
+	case <-ctx.Done():
+		return nil, ctx.Err()
+	}
+
+	select {
+	case result := <-req.ResultChan:
+		log.WithContext(ctx).Tracef("got account with backpressure after %s", time.Since(startTime))
+		return result.Account, result.Err
+	case <-ctx.Done():
+		return nil, ctx.Err()
+	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@management/server/account_request_buffer.go` around lines 66 - 70, The code
currently does a blocking send to ac.getAccountRequestCh and a blocking receive
from req.ResultChan which ignores the caller's ctx; change both operations to
non-blocking selects that also listen on ctx.Done(): when enqueuing
(ac.getAccountRequestCh <- req) use select { case ac.getAccountRequestCh <- req:
... case <-ctx.Done(): return nil/ctx.Err() } and when waiting for result
(result := <-req.ResultChan) use select { case result := <-req.ResultChan: ...
case <-ctx.Done(): return nil/ctx.Err() } so the function returns promptly on
caller cancellation; keep the Tracef timing/logging only when a real result is
received and ensure any returned error conveys the context cancellation.

Comment thread management/server/activity/store/sql_store.go Outdated
Comment on lines +1020 to +1023
if err := s.db.Model(&domain.Domain{}).Count(&total).Error; err != nil {
return 0, 0, err
}
if err := s.db.WithContext(ctx).Model(&domain.Domain{}).Where("validated = ?", true).Count(&validated).Error; err != nil {
if err := s.db.Model(&domain.Domain{}).Where("validated = ?", true).Count(&validated).Error; err != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Restore GORM context propagation for these DB calls.

These methods still accept ctx, but the changed GORM roots now use raw s.db, so request cancellation/deadlines no longer reach the database driver. Please keep the existing pattern by starting each chain with s.db.WithContext(ctx) before adding clauses/filters/updates.

🔎 Verify the changed ranges no longer start GORM chains from raw s.db

Expected result after the fix: no output for these changed ranges.

#!/bin/bash
python - <<'PY'
from pathlib import Path

path = Path("management/server/store/sql_store.go")
lines = path.read_text().splitlines()

ranges = [
    (1018, 1026),
    (4443, 4459),
    (4462, 4475),
    (4478, 4483),
    (4486, 4497),
    (4500, 4513),
    (5169, 5182),
    (5185, 5197),
    (5292, 5340),
    (5343, 5354),
    (5432, 5439),
    (5442, 5468),
    (5474, 5489),
    (5536, 5557),
    (5567, 5584),
]

for start, end in ranges:
    for lineno in range(start, end + 1):
        line = lines[lineno - 1]
        if "s.db" in line and "WithContext(ctx)" not in line:
            print(f"{path}:{lineno}: {line.rstrip()}")
PY
♻️ Representative fix pattern
- tx := s.db
+ tx := s.db.WithContext(ctx)

- result := s.db.
+ result := s.db.WithContext(ctx).

- if err := s.db.Model(&domain.Domain{}).Count(&total).Error; err != nil {
+ if err := s.db.WithContext(ctx).Model(&domain.Domain{}).Count(&total).Error; err != nil {

Also applies to: 4445-4451, 4464-4470, 4480-4480, 4488-4488, 4502-4504, 5171-5177, 5187-5193, 5297-5308, 5345-5347, 5434-5434, 5446-5465, 5478-5482, 5547-5553, 5571-5573

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@management/server/store/sql_store.go` around lines 1020 - 1023, The DB calls
dropped GORM context propagation by starting chains from raw s.db; update each
affected chain (e.g., the Count calls using s.db.Model(&domain.Domain{}), and
the other locations listed) to begin with s.db.WithContext(ctx) before adding
.Model(...), .Where(...), .Count(...), .Updates(...), etc.; search for
occurrences of "s.db.Model(" and similar chains in sql_store.go (notably around
the domain.Domain{} Count sites and the ranges called out) and prefix them with
WithContext(ctx) so every chain becomes
s.db.WithContext(ctx).Model(...).Where(...).Count/... or
s.db.WithContext(ctx).Where(...).Updates(...) accordingly.

Comment on lines +186 to +197
handlerDone := make(chan struct{})
context.AfterFunc(ctx, func() {
select {
case <-handlerDone:
default:
log.Debugf("HTTP request context canceled mid-flight: %v %v (reqID=%s, after %v, cause: %v)",
r.Method, r.URL.Path, reqID, time.Since(reqStart), context.Cause(ctx))
}
})

h.ServeHTTP(w, r.WithContext(ctx))
close(handlerDone)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

Go context.AfterFunc documentation returned stop function behavior and whether the callback runs in its own goroutine after ctx is done

💡 Result:

The context.AfterFunc(ctx Context, f func) (stop func bool) function, added in Go 1.21.0, arranges to call f in its own goroutine after ctx is done (canceled or timed out). If ctx is already done when AfterFunc is called, it calls f immediately in its own goroutine. The returned stop function stops the association of ctx with f. Calling stop returns true if it stopped f from being run. It returns false if either: - the context is done and f has already been started in its own goroutine, or - f was already stopped previously. The stop function does not wait for f to complete before returning. Multiple calls to AfterFunc on the same context operate independently. If the context implements an AfterFunc(func) func bool method, it is used instead of the default behavior.

Citations:


🏁 Script executed:

# Check if the file exists and examine the specific lines
if [ -f "management/server/telemetry/http_api_metrics.go" ]; then
  echo "=== File exists, checking lines 186-197 ==="
  sed -n '186,197p' management/server/telemetry/http_api_metrics.go
  echo ""
  echo "=== Context: lines 180-205 ==="
  sed -n '180,205p' management/server/telemetry/http_api_metrics.go
else
  echo "File not found. Searching for similar files..."
  find . -name "*http_api*" -o -name "*metrics*" | head -20
fi

Repository: netbirdio/netbird

Length of output: 1262


🏁 Script executed:

# Also search for context.AfterFunc usage in the codebase
echo "=== Searching for context.AfterFunc usage ==="
rg "context\.AfterFunc" --type go -B2 -A5

Repository: netbirdio/netbird

Length of output: 1195


🏁 Script executed:

# Get the function signature and broader context
echo "=== Function containing this code ==="
sed -n '160,210p' management/server/telemetry/http_api_metrics.go | head -60

Repository: netbirdio/netbird

Length of output: 1600


🏁 Script executed:

# Find the function name by searching backwards from line 186
rg -B 50 "context\.AfterFunc" management/server/telemetry/http_api_metrics.go | grep "func " | tail -1

Repository: netbirdio/netbird

Length of output: 124


🏁 Script executed:

# Check if this middleware is used in request handling
grep -n "HTTPMiddleware\|http_api_metrics" management/server/telemetry/http_api_metrics.go | head -20

Repository: netbirdio/netbird

Length of output: 440


🌐 Web query:

Go HTTP server request context cancellation behavior when handler returns

💡 Result:

In Go's net/http package, the request context obtained via http.Request.Context is canceled under three conditions for incoming server requests: when the client's connection closes, when the request is canceled (e.g., via HTTP/2 RST_STREAM), or when the handler's ServeHTTP method returns. This cancellation after the handler returns serves to terminate any background goroutines or lingering operations (like context.AfterFunc callbacks or database connections) that use the request context, preventing resource leaks. Handlers should not rely on the context remaining valid after returning, and any post-handler access to r.Context will observe it as canceled. This behavior is unchanged as of Go 1.25 (latest available documentation).

Citations:


Stop the cancellation callback after normal handler completion.

context.AfterFunc returns a stopper function that prevents the callback from being invoked. Since the HTTP server cancels the request context when the handler returns, this callback will be scheduled for every normal request completion. Although the callback exits immediately via the closed handlerDone channel, this still creates unnecessary per-request goroutine allocation on the hot path. Capturing and calling the stop function prevents this overhead while preserving mid-flight cancellation logging.

♻️ Proposed fix
 	handlerDone := make(chan struct{})
-	context.AfterFunc(ctx, func() {
+	stopCancelLog := context.AfterFunc(ctx, func() {
 		select {
 		case <-handlerDone:
 		default:
 			log.Debugf("HTTP request context canceled mid-flight: %v %v (reqID=%s, after %v, cause: %v)",
 				r.Method, r.URL.Path, reqID, time.Since(reqStart), context.Cause(ctx))
 		}
 	})
 
 	h.ServeHTTP(w, r.WithContext(ctx))
 	close(handlerDone)
+	stopCancelLog()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
handlerDone := make(chan struct{})
context.AfterFunc(ctx, func() {
select {
case <-handlerDone:
default:
log.Debugf("HTTP request context canceled mid-flight: %v %v (reqID=%s, after %v, cause: %v)",
r.Method, r.URL.Path, reqID, time.Since(reqStart), context.Cause(ctx))
}
})
h.ServeHTTP(w, r.WithContext(ctx))
close(handlerDone)
handlerDone := make(chan struct{})
stopCancelLog := context.AfterFunc(ctx, func() {
select {
case <-handlerDone:
default:
log.Debugf("HTTP request context canceled mid-flight: %v %v (reqID=%s, after %v, cause: %v)",
r.Method, r.URL.Path, reqID, time.Since(reqStart), context.Cause(ctx))
}
})
h.ServeHTTP(w, r.WithContext(ctx))
close(handlerDone)
stopCancelLog()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@management/server/telemetry/http_api_metrics.go` around lines 186 - 197, The
AfterFunc callback set up with context.AfterFunc is never stopped, so it will be
scheduled for every normal request return; capture the stopper returned by
context.AfterFunc (e.g., stop := context.AfterFunc(ctx, ...)) and invoke stop()
when the handler finishes to cancel the pending callback (call stop() just after
h.ServeHTTP(w, r.WithContext(ctx)) and before/after closing handlerDone),
leaving the existing handlerDone close and logging logic intact; reference
symbols: context.AfterFunc, stop (new variable), handlerDone, h.ServeHTTP,
reqID, reqStart, ctx.

@dfarrel1 dfarrel1 force-pushed the skip-migrations-env-var branch from 1310407 to a724e2f Compare April 19, 2026 16:19
Operators running multi-master Postgres logical replication (Spock /
pgactive / BDR) need to control which node runs DDL during a coordinated
upgrade. Logical replication replicates DML but not DDL; uncoordinated
AutoMigrate across sites races on DDL execution and can break the cluster.

This change exposes the existing skipMigration bool parameter on
store.NewStore via env var at the production call site
(management/internals/server/boot.go), and adds equivalent gating to
the activity store's NewSqlStore (which had no parameter at all).

Default behavior is unchanged: env var unset -> AutoMigrate runs.
Single-instance deployments are entirely unaffected.

Geolocation is intentionally not gated (uses SQLite per-instance, not
multi-master).

Relates to: netbirdio#1584
@dfarrel1 dfarrel1 force-pushed the skip-migrations-env-var branch from a724e2f to 00f3b92 Compare April 19, 2026 16:37
@sonarqubecloud
Copy link
Copy Markdown

@dfarrel1
Copy link
Copy Markdown
Author

Addressed the two actionable items from this round of review in the amended commit (force-pushed — previous unrelated commits in the diff were a branch-base issue; the branch was cut from v0.68.3, which after the tag had its commits rebased on main under different SHAs, so GitHub was showing them as "new"; rebasing onto current main removed them).

Log message wording (per @copilot): now reads "NETBIRD_SKIP_MIGRATIONS=true and activity store engine is Postgres; skipping events database migration and AutoMigrate/schema updates" — explicitly mentions both migrate(...) and AutoMigrate(...) being suppressed.

SQLite safety (per @coderabbitai, good catch): the skip is now gated on NB_ACTIVITY_EVENT_STORE_ENGINE == postgres in addition to NETBIRD_SKIP_MIGRATIONS=true. A per-node SQLite activity store (the default) still creates its schema on fresh nodes, which is what you'd want even in an HA deployment where the main store is Postgres multi-master but events are kept per-site:

skipMigrations := os.Getenv("NETBIRD_SKIP_MIGRATIONS") == "true" &&
    os.Getenv(storeEngineEnv) == string(types.PostgresStoreEngine)

Other comments on account_request_buffer.go, sql_store.go, http_api_metrics.go, and telemetry/http_api_metrics.go were on upstream commits that were incorrectly included in the diff due to the branch-base issue — not changes this PR is proposing. They should no longer appear after the force-push.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 19, 2026

Tip

For best results, initiate chat on the files or code changes.

@dfarrel1 Thanks for the thorough follow-up! Both fixes look correct:

  • Log message: The explicit mention of both migrate(...) and AutoMigrate(...) being skipped makes the operator-facing log much clearer about what is and isn't running.

  • SQLite safety gate: The compound condition is exactly right. Gating on storeEngineEnv == postgres ensures that a per-node SQLite activity store (the default, and the common case in mixed HA topologies) is never accidentally left schema-less on a fresh node — even if an operator sets NETBIRD_SKIP_MIGRATIONS=true globally. The behaviour is now:

    NETBIRD_SKIP_MIGRATIONS NB_ACTIVITY_EVENT_STORE_ENGINE Activity store migrations run?
    true postgres ❌ skipped
    true anything else (SQLite default) ✅ runs
    unset / false any ✅ runs

    That's the right matrix for the HA use case this PR targets.

The branch-base explanation also makes sense — rebasing onto current main after a tag rebase is a known GitHub diff-noise scenario; glad those unrelated files are out of the picture now.


🧠 Learnings used
Learnt from: lixmal
Repo: netbirdio/netbird PR: 5530
File: shared/management/http/api/openapi.yml:3018-3025
Timestamp: 2026-03-07T13:27:59.705Z
Learning: For PR netbirdio/netbird#5530, keep the flat ServiceRequest schema in shared/management/http/api/openapi.yml for this PR; plan a follow-up to split HTTP vs L4 request shapes. Server-side already validates per-mode constraints; OpenAPI cleanup is deferred.

Learnt from: pappz
Repo: netbirdio/netbird PR: 5512
File: client/internal/metrics/push.go:167-181
Timestamp: 2026-03-13T09:34:36.929Z
Learning: In netbirdio/netbird PR `#5512`, the `NB_METRICS_INTERVAL` env var (`p.envInterval`) in `client/internal/metrics/push.go` does NOT bypass remote config version gating (`isVersionInRange`) or the `RefreshIfNeeded` nil check. The earlier documentation describing this override behavior was deprecated. Do not flag the `resolve()` function for failing to skip version/config checks when `envInterval > 0` — the current behavior is intentional.

Learnt from: lixmal
Repo: netbirdio/netbird PR: 5530
File: management/server/store/store.go:447-452
Timestamp: 2026-03-07T14:48:07.384Z
Learning: In netbirdio/netbird (management/server/store/store.go), `idx_services_domain` was intentionally dropped in PR `#5530` to support L4 services sharing the cluster domain. HTTP domain uniqueness is enforced at the application level via `checkDomainAvailable()` running inside a transaction; a partial/filtered unique index scoped to HTTP services is planned as a future follow-up for PostgreSQL.

Learnt from: lixmal
Repo: netbirdio/netbird PR: 5530
File: management/server/store/sql_store.go:5161-5188
Timestamp: 2026-03-07T13:27:08.681Z
Learning: In netbirdio/netbird (PR `#5530`), store methods GetServicesByClusterAndPort and GetServicesByCluster in management/server/store/sql_store.go are intentionally lightweight for L4 port-collision checks: they should not Preload("Targets") or decrypt sensitive fields; selecting only id, account_id, proxy_cluster, mode, listen_port is acceptable to minimize overhead.

Learnt from: lixmal
Repo: netbirdio/netbird PR: 5587
File: shared/management/http/api/openapi.yml:3151-3171
Timestamp: 2026-03-13T06:22:59.436Z
Learning: Repo netbirdio/netbird PR `#5587`: For reverse proxy header authentication, HeaderAuthConfig.value is required in requests and cleared in responses. On update, Manager.preserveExistingAuthSecrets() calls preserveHeaderAuthHashes() before validateHeaderAuthValues() (management/internals/modules/reverseproxy/service/manager/manager.go Lines ~555–597), so sending value: "" preserves the existing hashed secret; providing a non-empty value rotates it.

Learnt from: lixmal
Repo: netbirdio/netbird PR: 5722
File: go.mod:45-46
Timestamp: 2026-03-30T08:59:37.316Z
Learning: In netbirdio/netbird (go.mod), github.com/crowdsecurity/crowdsec is intentionally pinned to v1.7.7-rc1 (a release candidate) because it contains security fixes reported by Snyk that are absent from the latest stable release v1.7.6. Do not flag this RC pin as a stability concern during code review.

Learnt from: lixmal
Repo: netbirdio/netbird PR: 5698
File: management/server/account.go:2549-2595
Timestamp: 2026-04-08T06:14:34.119Z
Learning: In `management/server/account.go` (netbirdio/netbird, PR `#5698`), `updatePeerIPv6InTransaction` intentionally does NOT apply the same host-part filtering that `AllocateRandomPeerIPv6()` uses (e.g., avoiding the all-zero host / subnet-router anycast address per RFC 4291 §2.6.1). IPv6 has no broadcast address, and subnet-router anycast is irrelevant for overlay addresses. The allocator avoids these addresses by convention only; an admin explicitly setting a specific peer IPv6 via the API is permitted to use any valid address within the account's v6 network range. Do not flag the absence of allocator-style host-part validation in `updatePeerIPv6InTransaction` as a bug.

Learnt from: lixmal
Repo: netbirdio/netbird PR: 5826
File: client/internal/portforward/manager.go:35-38
Timestamp: 2026-04-08T14:55:48.251Z
Learning: In `client/internal/portforward/manager.go` (netbirdio/netbird), persisting permanent-lease (ttl=0/UPnP error 725 fallback) port mapping state for crash-recovery cleanup is intentionally deferred to a follow-up PR. The blocker is that `State.Cleanup` requires NAT gateway re-discovery via `nat.DiscoverGateway`, which can block startup for ~10 seconds when no gateway is present, affecting all clients. The TODO comment at line ~35 documents this constraint. Do not flag the missing state persistence for permanent leases as a blocking issue in this PR.

Learnt from: lixmal
Repo: netbirdio/netbird PR: 5891
File: util/capture/filter.go:164-183
Timestamp: 2026-04-15T10:26:03.137Z
Learning: In `util/capture/filter.go` (netbirdio/netbird, PR `#5891`), `parseIPv4Info` does not guard L4 port extraction on fragment offset == 0. Non-initial IPv4 fragments will have their payload bytes misread as src/dst ports, which can cause spurious matches for port-based filters. The fix (checking `binary.BigEndian.Uint16(data[6:8]) & 0x1fff == 0` before reading ports) was deferred to a future PR. Do not treat the missing fragment-offset guard as a blocking issue for PR `#5891`.

Learnt from: pappz
Repo: netbirdio/netbird PR: 5888
File: client/android/client.go:200-220
Timestamp: 2026-04-14T17:54:56.869Z
Learning: In netbirdio/netbird, `addStateFile()` in `client/internal/debug/debug.go` calls `NewServiceManager("").GetStatePath()` which resolves to `/var/lib/netbird/state.json` (the Linux default). On Android, this path does not exist; the real state file path is provided by `PlatformFiles.StateFilePath()`. The fix is to add a `StatePath` field to `GeneratorDependencies` and thread it through `addStateFile()` with a fallback to the `ServiceManager` lookup when empty. This is a pre-existing issue in the upstream debug.go (not introduced by PR `#5888`) and is deferred to a follow-up PR.

Learnt from: lixmal
Repo: netbirdio/netbird PR: 5530
File: management/server/store/sql_store.go:5002-5006
Timestamp: 2026-03-07T14:50:25.550Z
Learning: In netbirdio/netbird PR `#5530`, reverse-proxy Service.Mode is created with gorm:"default:'http'" and migration backfills existing rows, so Mode should never be NULL after migration. Therefore, GetHTTPServiceByDomain intentionally checks for mode = '' OR mode = 'http' and does not need a mode IS NULL clause.

If you found this review helpful, would you consider giving us a shout-out on X?

Thank you for using CodeRabbit!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants