[management] Fix/delete groups without lock#5012
Conversation
📝 WalkthroughWalkthroughAdds deferred duration logging to gRPC handlers, reduces lock strength and early-returns in group deletion when nothing to delete, and introduces per-store transaction timeouts with deadline-aware logging and cancellation in the SQL store. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Client
participant GRPC as gRPC Handler (Sync/Login)
participant Store as SqlStore.ExecuteInTransaction
participant DB as Database (tx)
Client->>GRPC: Send Sync/Login request
GRPC->>GRPC: record reqStart (deferred duration log)
GRPC->>Store: ExecuteInTransaction(ctx, fn)
note right of Store `#f0f7ff`: create timeoutCtx (s.transactionTimeout)
Store->>DB: BEGIN (with timeoutCtx)
DB-->>Store: tx
Store->>DB: run operations using tx
alt completes before timeout
DB-->>Store: operation success
Store->>DB: COMMIT
DB-->>Store: commit success
Store-->>GRPC: return success
else deadline exceeded
DB-->>Store: context deadline exceeded / error
Store->>Store: log warning + stack trace (elapsed)
Store-->>GRPC: return error
end
GRPC->>GRPC: emit deferred duration log
GRPC-->>Client: respond
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
📜 Recent review detailsConfiguration used: defaults Review profile: CHILL Plan: Pro 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (24)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
management/server/group.go (1)
428-443: Potential TOCTOU race condition with None locking.Lowering the locking strength from Update to None during group validation introduces a time-of-check to time-of-use window. Between validation (line 436) and deletion (line 449), another transaction could link the group to a resource, policy, or other entity. The validation would pass, but the group should no longer be deletable.
Consider whether this race condition is acceptable in your system, or if you should restore at least Read locking during validation.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
management/internals/shared/grpc/server.gomanagement/server/group.gomanagement/server/store/sql_store.go
🧰 Additional context used
🧬 Code graph analysis (3)
management/internals/shared/grpc/server.go (1)
monotime/time.go (1)
Since(33-35)
management/server/group.go (1)
management/server/store/store.go (1)
LockingStrengthNone(47-47)
management/server/store/sql_store.go (1)
management/server/telemetry/store_metrics.go (1)
StoreMetrics(11-18)
🪛 GitHub Check: Management / Benchmark (API) (amd64, postgres)
management/server/store/sql_store.go
[failure] 40-40:
"github.com/netbirdio/netbird/shared/sshauth" imported and not used
🪛 GitHub Check: Management / Benchmark (API) (amd64, sqlite)
management/server/store/sql_store.go
[failure] 40-40:
"github.com/netbirdio/netbird/shared/sshauth" imported and not used
🪛 GitHub Check: Management / Integration (amd64, postgres)
management/server/store/sql_store.go
[failure] 40-40:
"github.com/netbirdio/netbird/shared/sshauth" imported and not used
🪛 GitHub Check: Management / Integration (amd64, sqlite)
management/server/store/sql_store.go
[failure] 40-40:
"github.com/netbirdio/netbird/shared/sshauth" imported and not used
🪛 GitHub Check: Management / Unit (amd64, sqlite)
management/server/store/sql_store.go
[failure] 40-40:
"github.com/netbirdio/netbird/shared/sshauth" imported and not used
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (22)
- GitHub Check: Client / Unit (386)
- GitHub Check: Management / Unit (amd64, postgres)
- GitHub Check: Management / Benchmark (amd64, postgres)
- GitHub Check: Management / Unit (amd64, mysql)
- GitHub Check: Relay / Unit (386)
- GitHub Check: Management / Benchmark (amd64, sqlite)
- GitHub Check: Client / Unit (amd64)
- GitHub Check: Relay / Unit (amd64, -race)
- GitHub Check: Client (Docker) / Unit
- GitHub Check: Darwin
- GitHub Check: Linux
- GitHub Check: Windows
- GitHub Check: release
- GitHub Check: Android / Build
- GitHub Check: FreeBSD Port / Build & Test
- GitHub Check: release_ui_darwin
- GitHub Check: Client / Unit
- GitHub Check: release_ui
- GitHub Check: iOS / Build
- GitHub Check: Client / Unit
- GitHub Check: Client / Unit
- GitHub Check: JS / Lint
🔇 Additional comments (6)
management/internals/shared/grpc/server.go (2)
230-232: LGTM! Good observability improvement.The deferred duration logging for the Sync operation enhances observability without affecting the existing control flow.
568-573: LGTM! Consistent timing log.The Login duration logging is consistent with the Sync operation timing pattern and provides valuable observability.
management/server/group.go (1)
445-447: Good optimization with early return guard.The early return when no groups are deletable prevents unnecessary transaction work and avoids incrementing the network serial when no changes occur. This is a clean optimization.
management/server/store/sql_store.go (3)
2901-2902: Using context.Background() ignores caller's context cancellation.The transaction timeout context is created from
context.Background()instead of the passedctxparameter. This means if the caller's context is cancelled, the transaction will continue running for up to 5 minutes, ignoring the cancellation signal.This may be intentional to prevent transaction interruption, but it's unusual behavior. If this is by design, consider documenting why the caller's context is ignored. Otherwise, consider creating the timeout from the caller's context:
timeoutCtx, cancel := context.WithTimeout(ctx, 5*time.Minute).
2936-2944: Good timeout-aware error handling with stack traces.The timeout error handling after the operation is well-implemented. The defensive check for both the operation error and the timeout context error ensures timeout scenarios are properly logged with stack traces for debugging.
2954-2967: Correct nil return on successful commit.The explicit
return nilon Line 2967 correctly ensures that a successful transaction returns no error, rather than potentially propagating a stale error value. The timeout-aware error handling on commit is also well-implemented.
There was a problem hiding this comment.
Actionable comments posted: 2
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
management/server/store/sql_store.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (27)
- GitHub Check: Management / Integration (amd64, postgres)
- GitHub Check: Management / Benchmark (API) (amd64, postgres)
- GitHub Check: Management / Benchmark (API) (amd64, sqlite)
- GitHub Check: Management / Integration (amd64, sqlite)
- GitHub Check: Management / Benchmark (amd64, postgres)
- GitHub Check: Management / Unit (amd64, postgres)
- GitHub Check: Client (Docker) / Unit
- GitHub Check: Relay / Unit (amd64, -race)
- GitHub Check: Management / Unit (amd64, sqlite)
- GitHub Check: Management / Unit (amd64, mysql)
- GitHub Check: Signal / Unit (amd64)
- GitHub Check: Client / Unit (amd64)
- GitHub Check: Client / Unit (386)
- GitHub Check: Management / Benchmark (amd64, sqlite)
- GitHub Check: Relay / Unit (386)
- GitHub Check: release_ui
- GitHub Check: release
- GitHub Check: FreeBSD Port / Build & Test
- GitHub Check: release_ui_darwin
- GitHub Check: Darwin
- GitHub Check: Windows
- GitHub Check: Linux
- GitHub Check: iOS / Build
- GitHub Check: Android / Build
- GitHub Check: Client / Unit
- GitHub Check: Client / Unit
- GitHub Check: JS / Lint
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (2)
management/server/store/sql_store.go (2)
2910-2915: Critical:ExecuteInTransactionshould derive its timeout from the passedctx, notcontext.Background().
ExecuteInTransactioncurrently does:timeoutCtx, cancel := context.WithTimeout(context.Background(), s.transactionTimeout)This breaks the context chain and ignores caller cancellation, deadlines, and values (tracing, request IDs, etc.). It also means that if the caller’s
ctxis already cancelled or has a shorter deadline, that is not respected.You should derive the timeout context from
ctxand also handle the “no/invalid timeout” case explicitly. For example:- timeoutCtx, cancel := context.WithTimeout(context.Background(), s.transactionTimeout) + var ( + timeoutCtx context.Context + cancel context.CancelFunc + ) + if s.transactionTimeout <= 0 { + // No store-level timeout; still honor caller cancellation. + timeoutCtx, cancel = context.WithCancel(ctx) + } else { + timeoutCtx, cancel = context.WithTimeout(ctx, s.transactionTimeout) + } defer cancel()This preserves caller semantics while still enforcing the store‑level timeout.
Go stdlib `context.WithTimeout` semantics when the parent context is already cancelled or has an earlier deadline, and recommended practice for composing timeouts with a parent context.
2949-2951: Fix timeout log messages to reflect the configured duration instead of hardcoded “5 minute timeout”.Both timeout warning logs still say
"transaction exceeded 5 minute timeout"even though the actual timeout comes froms.transactionTimeoutand can be overridden viaNB_STORE_TRANSACTION_TIMEOUT. This can be misleading when operators tune the timeout.Suggested change:
- if errors.Is(err, context.DeadlineExceeded) || errors.Is(timeoutCtx.Err(), context.DeadlineExceeded) { - log.WithContext(ctx).Warnf("transaction exceeded 5 minute timeout after %v, stack: %s", time.Since(startTime), debug.Stack()) - } + if errors.Is(err, context.DeadlineExceeded) || errors.Is(timeoutCtx.Err(), context.DeadlineExceeded) { + log.WithContext(ctx).Warnf("transaction exceeded %v timeout after %v, stack: %s", s.transactionTimeout, time.Since(startTime), debug.Stack()) + }and similarly for the commit‑timeout branch:
- log.WithContext(ctx).Warnf("transaction commit exceeded 5 minute timeout after %v, stack: %s", time.Since(startTime), debug.Stack()) + log.WithContext(ctx).Warnf("transaction commit exceeded %v timeout after %v, stack: %s", s.transactionTimeout, time.Since(startTime), debug.Stack())Confirm that `time.Duration` formatting (e.g., `%v` on `s.transactionTimeout`) produces human-readable output suitable for logs in Go.Also applies to: 2963-2969
🧹 Nitpick comments (2)
management/server/store/sql_store.go (1)
58-68: Per‑store transaction timeout wiring looks good; consider guarding non‑positive values.The new
transactionTimeoutfield is correctly initialized inNewSqlStore(including theskipMigrationpath) and logged, so allSqlStoreinstances share a consistent timeout.One small hardening you might consider: if
NB_STORE_TRANSACTION_TIMEOUTparses to<= 0, either:
- fall back to the default
5 * time.Minute, or- treat it explicitly as “no timeout” and document/log that behavior.
That avoids accidental
0s/ negative misconfig causing transactions to immediately time out.Also applies to: 89-96, 112-115, 133-133
management/server/store/sql_store_test.go (1)
8-8: Test is solid for SQLite; consider minor robustness for multi-driver future-proofing.The new
TestSqlStore_ExecuteInTransaction_Timeoutcorrectly verifies that:
NB_STORE_TRANSACTION_TIMEOUTis parsed and applied (1 second),- a transaction exceeding that timeout fails with an error.
The test uses
errors.Is(err, context.DeadlineExceeded)which is the correct pattern. However, across different SQL drivers (Postgres, MySQL, SQLite), the exact error returned fromCommit()on context timeout can vary: Postgres may surface DB-specific errors, MySQL behavior is inconsistent across drivers, and only SQLite with proper context support reliably returnsDeadlineExceeded. Since this test currently targets SQLite only, the assertion works as-is, but if you plan to extend this to other engines, consider broadening the check:- require.Error(t, err) - assert.True(t, errors.Is(err, context.DeadlineExceeded), "expected context.DeadlineExceeded error, got: %v", err) + require.Error(t, err) + assert.True(t, errors.Is(err, context.DeadlineExceeded) || errors.Is(err, context.Canceled), + "expected context deadline or cancel error, got: %v", err)This makes the test more resilient if run against multiple database engines in the future.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
management/server/store/sql_store.gomanagement/server/store/sql_store_test.go
🧰 Additional context used
🧬 Code graph analysis (2)
management/server/store/sql_store.go (1)
management/server/telemetry/store_metrics.go (1)
StoreMetrics(11-18)
management/server/store/sql_store_test.go (2)
management/server/store/store.go (1)
Store(50-208)management/server/store/sql_store.go (1)
SqlStore(59-68)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (28)
- GitHub Check: Client / Unit (amd64)
- GitHub Check: Management / Benchmark (API) (amd64, sqlite)
- GitHub Check: Management / Integration (amd64, postgres)
- GitHub Check: Signal / Unit (386)
- GitHub Check: Management / Integration (amd64, sqlite)
- GitHub Check: Management / Benchmark (API) (amd64, postgres)
- GitHub Check: Management / Unit (amd64, sqlite)
- GitHub Check: Management / Unit (amd64, postgres)
- GitHub Check: Client / Unit (386)
- GitHub Check: Management / Benchmark (amd64, sqlite)
- GitHub Check: Management / Unit (amd64, mysql)
- GitHub Check: Relay / Unit (386)
- GitHub Check: Management / Benchmark (amd64, postgres)
- GitHub Check: Relay / Unit (amd64, -race)
- GitHub Check: Client (Docker) / Unit
- GitHub Check: Client / Unit
- GitHub Check: Client / Unit
- GitHub Check: release
- GitHub Check: FreeBSD Port / Build & Test
- GitHub Check: release_ui_darwin
- GitHub Check: release_ui
- GitHub Check: Linux
- GitHub Check: Darwin
- GitHub Check: Windows
- GitHub Check: Client / Unit
- GitHub Check: iOS / Build
- GitHub Check: Android / Build
- GitHub Check: JS / Lint
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
management/server/store/sql_store.go (1)
89-95: Consider adding timeout validation and parse error logging.The current implementation silently falls back to the default timeout if parsing fails or if an invalid timeout is provided. Consider these optional improvements for better robustness:
- Log a warning when parsing fails so operators are aware their configuration is ignored
- Validate that the parsed timeout is positive (reject zero or negative values)
🔎 Suggested enhancement
transactionTimeout := 5 * time.Minute if v := os.Getenv("NB_STORE_TRANSACTION_TIMEOUT"); v != "" { if parsed, err := time.ParseDuration(v); err == nil { + if parsed <= 0 { + log.WithContext(ctx).Warnf("Invalid NB_STORE_TRANSACTION_TIMEOUT %v (must be positive), using default %v", parsed, transactionTimeout) + } else { transactionTimeout = parsed + } + } else { + log.WithContext(ctx).Warnf("Failed to parse NB_STORE_TRANSACTION_TIMEOUT=%q: %v, using default %v", v, err, transactionTimeout) } } log.WithContext(ctx).Infof("Setting transaction timeout to %v", transactionTimeout)
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
management/server/store/sql_store.gomanagement/server/store/sql_store_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- management/server/store/sql_store_test.go
🧰 Additional context used
🧬 Code graph analysis (1)
management/server/store/sql_store.go (2)
monotime/time.go (2)
Now(28-31)Since(33-35)management/server/telemetry/store_metrics.go (1)
StoreMetrics(11-18)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (28)
- GitHub Check: Management / Benchmark (API) (amd64, postgres)
- GitHub Check: Management / Integration (amd64, postgres)
- GitHub Check: Management / Benchmark (API) (amd64, sqlite)
- GitHub Check: Management / Integration (amd64, sqlite)
- GitHub Check: Signal / Unit (386)
- GitHub Check: Relay / Unit (386)
- GitHub Check: Management / Benchmark (amd64, sqlite)
- GitHub Check: Management / Unit (amd64, sqlite)
- GitHub Check: Management / Unit (amd64, mysql)
- GitHub Check: Relay / Unit (amd64, -race)
- GitHub Check: Management / Unit (amd64, postgres)
- GitHub Check: Management / Benchmark (amd64, postgres)
- GitHub Check: Client / Unit (amd64)
- GitHub Check: Signal / Unit (amd64)
- GitHub Check: Client / Unit (386)
- GitHub Check: Client (Docker) / Unit
- GitHub Check: Client / Unit
- GitHub Check: release_ui_darwin
- GitHub Check: release
- GitHub Check: release_ui
- GitHub Check: FreeBSD Port / Build & Test
- GitHub Check: Linux
- GitHub Check: Windows
- GitHub Check: Darwin
- GitHub Check: iOS / Build
- GitHub Check: Android / Build
- GitHub Check: JS / Lint
- GitHub Check: Client / Unit
🔇 Additional comments (1)
management/server/store/sql_store.go (1)
2910-2976: Transaction timeout implementation looks good.The changes correctly implement configurable transaction timeouts with proper timeout detection and logging. The timeout value is now correctly referenced in log messages (addressing the past review comment).
Note: The use of
context.Background()at line 2910 (rather than deriving from the passedctxparameter) is a known limitation that was previously discussed and will be addressed in a future improvement per maintainer feedback.
|



Describe your changes
Issue ticket number and link
Stack
Checklist
Documentation
Select exactly one:
Docs PR URL (required if "docs added" is checked)
Paste the PR link from https://github.com/netbirdio/docs here:
https://github.com/netbirdio/docs/pull/__
Summary by CodeRabbit
Bug Fixes
Performance
Tests
Other
✏️ Tip: You can customize this high-level summary in your review settings.