[management] Replace in-memory expose tracker with SQL-backed operations#5494
[management] Replace in-memory expose tracker with SQL-backed operations#5494
Conversation
The expose tracker used sync.Map for in-memory TTL tracking of active expose sessions, which broke in HA deployments (renew landing on a different instance) and lost all sessions on restart. Replace with SQL-backed operations that reuse the existing meta_last_renewed_at column: - Add store methods: RenewEphemeralService, GetExpiredEphemeralServices, CountEphemeralServicesByPeer, EphemeralServiceExists - Move duplicate/limit checks inside a transaction with row-level locking (SELECT ... FOR UPDATE) to prevent concurrent bypass - Reaper re-checks expiry under row lock to avoid deleting a just-renewed service and prevent duplicate event emission in HA - Add composite index on (source, source_peer) for efficient queries - Batch-limit and column-select the reaper query to avoid DB/GC spikes - Filter out malformed rows with empty source_peer
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughReplaces in-memory per-peer ephemeral-service tracking with a DB-backed exposeReaper. Adds store methods to renew, query, count, and check ephemeral services; refactors Manager to persist/delete/renew ephemeral services transactionally; reaper batches expired services and deletes them via the Manager/store. Tests converted to integration-style assertions. Changes
Sequence Diagram(s)mermaid Reaper->>Store: GetExpiredEphemeralServices(ttl, limit) Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
management/internals/modules/reverseproxy/service/manager/expose_tracker_test.go (1)
67-103: Rename this test to match what it actually verifies.
TestConcurrentReapAndRenewcurrently performs concurrent reap + count, not renew.✏️ Suggested rename
-func TestConcurrentReapAndRenew(t *testing.T) { +func TestConcurrentReapAndCount(t *testing.T) {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@management/internals/modules/reverseproxy/service/manager/expose_tracker_test.go` around lines 67 - 103, The test function TestConcurrentReapAndRenew is misnamed because it runs a concurrent reap and a CountEphemeralServicesByPeer call (not a renew); rename the test to something accurate (e.g., TestConcurrentReapAndCount or TestConcurrentReapAndCountEphemeralServicesByPeer) by updating the function declaration, keeping the body intact; this affects the function named TestConcurrentReapAndRenew and references to mgr.exposeReaper.reapExpiredExposes and mgr.store.CountEphemeralServicesByPeer so reviewers can locate the logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@management/internals/modules/reverseproxy/service/manager/expose_tracker.go`:
- Around line 49-56: The code calls status.FromError(err) and dereferences
s.ErrorType without checking s for nil; update the deleteExpiredPeerService
error handling in expose_tracker.go (the block where err :=
r.manager.deleteExpiredPeerService(...) and s, _ := status.FromError(err) are
used) to first check err==nil, then if err!=nil call status.FromError and guard
s!=nil before accessing s.ErrorType; treat non-status errors (s==nil) as generic
errors (log or handle appropriately) and keep the existing NotFound branch when
s!=nil and s.ErrorType == status.NotFound so the reaper goroutine cannot panic
on non-status errors.
In `@management/internals/modules/reverseproxy/service/manager/manager.go`:
- Around line 222-237: In persistNewEphemeralService: the per-peer expose limit
can be bypassed due to zero-row gap races because EphemeralServiceExists and
CountEphemeralServicesByPeer don't lock when no rows exist; to fix, inside the
transaction call transaction.GetPeerByID(ctx, store.LockingStrengthUpdate,
accountID, peerID) before invoking
EphemeralServiceExists/CountEphemeralServicesByPeer to acquire a FOR UPDATE lock
on the peer row and serialize concurrent create attempts for that peer, and
handle/return any error from GetPeerByID (e.g., not found or DB error) before
proceeding with the existing checks.
---
Nitpick comments:
In
`@management/internals/modules/reverseproxy/service/manager/expose_tracker_test.go`:
- Around line 67-103: The test function TestConcurrentReapAndRenew is misnamed
because it runs a concurrent reap and a CountEphemeralServicesByPeer call (not a
renew); rename the test to something accurate (e.g., TestConcurrentReapAndCount
or TestConcurrentReapAndCountEphemeralServicesByPeer) by updating the function
declaration, keeping the body intact; this affects the function named
TestConcurrentReapAndRenew and references to mgr.exposeReaper.reapExpiredExposes
and mgr.store.CountEphemeralServicesByPeer so reviewers can locate the logic.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
management/internals/modules/reverseproxy/service/manager/expose_tracker.gomanagement/internals/modules/reverseproxy/service/manager/expose_tracker_test.gomanagement/internals/modules/reverseproxy/service/manager/manager.gomanagement/internals/modules/reverseproxy/service/manager/manager_test.gomanagement/internals/modules/reverseproxy/service/service.gomanagement/server/store/sql_store.gomanagement/server/store/store.gomanagement/server/store/store_mock.go
…bypass - Serialize concurrent service creation for the same peer using row-level locking on the peer table. - Adjust error handling in expose tracker to handle deletion results more robustly.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
management/internals/modules/reverseproxy/service/manager/expose_tracker.go (1)
46-59: Consider short-circuiting the reap loop when context is canceled.If shutdown happens mid-batch, Line 49 will execute with a canceled context and may emit noisy errors for the rest of the batch. A fast
ctx.Err()check inside the loop makes teardown quieter.Suggested tweak
for _, svc := range expired { + if ctx.Err() != nil { + return + } + log.Infof("reaping expired expose session for peer %s, domain %s", svc.SourcePeer, svc.Domain) err := r.manager.deleteExpiredPeerService(ctx, svc.AccountID, svc.SourcePeer, svc.ID)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@management/internals/modules/reverseproxy/service/manager/expose_tracker.go` around lines 46 - 59, The loop in expose_tracker.go should short-circuit when the context is canceled to avoid noisy errors during shutdown: inside the for _, svc := range expired loop, check ctx.Err() (or ctx.Done()) before calling r.manager.deleteExpiredPeerService and break/return if it is non-nil, so that deleteExpiredPeerService is not invoked with a canceled context; update the block around the call to r.manager.deleteExpiredPeerService(ctx, svc.AccountID, svc.SourcePeer, svc.ID) accordingly to stop processing remaining svc entries when ctx is canceled.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@management/internals/modules/reverseproxy/service/manager/manager.go`:
- Around line 885-912: The second TTL check is unreliable for inferring
deletion; modify the transaction block around transaction.DeleteService to set a
local boolean (e.g., deleted := false then deleted = true only when
transaction.DeleteService succeeds) and return that state from the transaction
context to the outer scope, then guard the post-transaction side effects (calls
to m.store.GetPeerByID, addPeerInfoToEventMeta, m.accountManager.StoreEvent,
m.proxyController.SendServiceUpdateToCluster, and
m.accountManager.UpdateAccountPeers) behind that deleted flag instead of
re-checking svc.Meta.LastRenewedAt against exposeTTL; ensure the flag is only
set on successful deletion and used to decide whether to emit events/send
cluster updates.
---
Nitpick comments:
In `@management/internals/modules/reverseproxy/service/manager/expose_tracker.go`:
- Around line 46-59: The loop in expose_tracker.go should short-circuit when the
context is canceled to avoid noisy errors during shutdown: inside the for _, svc
:= range expired loop, check ctx.Err() (or ctx.Done()) before calling
r.manager.deleteExpiredPeerService and break/return if it is non-nil, so that
deleteExpiredPeerService is not invoked with a canceled context; update the
block around the call to r.manager.deleteExpiredPeerService(ctx, svc.AccountID,
svc.SourcePeer, svc.ID) accordingly to stop processing remaining svc entries
when ctx is canceled.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 44dd1e57-1e36-4943-bd2b-99513b560c60
📒 Files selected for processing (2)
management/internals/modules/reverseproxy/service/manager/expose_tracker.gomanagement/internals/modules/reverseproxy/service/manager/manager.go
There was a problem hiding this comment.
🧹 Nitpick comments (1)
management/internals/modules/reverseproxy/service/manager/manager.go (1)
723-729: Consider removing the extra peer read in the create flow.
CreateServiceFromPeerreads the peer at Line 718, andpersistNewEphemeralServicereads/locks it again at Line 228. You can likely keep only one read (post-persist for event metadata) to reduce DB round-trips.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@management/internals/modules/reverseproxy/service/manager/manager.go` around lines 723 - 729, CreateServiceFromPeer currently reads/locks the peer before persisting but persistNewEphemeralService reads/locks it again; remove the pre-persist peer read to eliminate the extra DB round-trip by only setting svc.SourcePeer = peerID and letting persistNewEphemeralService perform the peer load/lock and return any needed peer metadata/event info. Concretely, stop calling the peer-read in CreateServiceFromPeer (leave svc.SourcePeer = peerID and LastRenewedAt set), update persistNewEphemeralService to accept the peerID (or keep its signature but perform the single read internally) and return the loaded peer or event fields so CreateServiceFromPeer can use them for subsequent event/logging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@management/internals/modules/reverseproxy/service/manager/manager.go`:
- Around line 723-729: CreateServiceFromPeer currently reads/locks the peer
before persisting but persistNewEphemeralService reads/locks it again; remove
the pre-persist peer read to eliminate the extra DB round-trip by only setting
svc.SourcePeer = peerID and letting persistNewEphemeralService perform the peer
load/lock and return any needed peer metadata/event info. Concretely, stop
calling the peer-read in CreateServiceFromPeer (leave svc.SourcePeer = peerID
and LastRenewedAt set), update persistNewEphemeralService to accept the peerID
(or keep its signature but perform the single read internally) and return the
loaded peer or event fields so CreateServiceFromPeer can use them for subsequent
event/logging.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 2a522487-413e-48af-8f95-571818425f8d
📒 Files selected for processing (1)
management/internals/modules/reverseproxy/service/manager/manager.go
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
management/internals/modules/reverseproxy/service/manager/expose_tracker.go (1)
52-53: Avoid success-sounding info log before deletion outcome is known.Line 52 logs as if reaping is happening, but deletion can legitimately no-op after lock re-check (recent renew). Consider debug wording like “attempting reap”.
📝 Suggested tweak
- log.Infof("reaping expired expose session for peer %s, domain %s", svc.SourcePeer, svc.Domain) + log.Debugf("attempting to reap expired expose candidate for peer %s, domain %s", svc.SourcePeer, svc.Domain)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@management/internals/modules/reverseproxy/service/manager/expose_tracker.go` around lines 52 - 53, The info log "reaping expired expose session for peer %s, domain %s" should not assert success before deletion; change this log (at the point where svc is checked and before deletion/lock re-check) to a debug/trace-level message like "attempting to reap expired expose session for peer %s, domain %s" and only emit an info-level message after you actually remove the session (or emit a debug/no-op message if the lock re-check prevents deletion); update the log call that references svc.SourcePeer and svc.Domain accordingly so the pre-delete message doesn't sound like a completed action.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@management/internals/modules/reverseproxy/service/manager/expose_tracker.go`:
- Around line 27-29: Replace the blocking time.Sleep call so startup jitter
respects context cancellation: after computing rn := rand.IntN(10), create a
timer for time.Duration(rn)*time.Second and use a select that waits on either
the timer.C or ctx.Done(); on ctx.Done() stop the timer and return/abort the
startup path to avoid delaying shutdown. Target the rn := rand.IntN(10) and
time.Sleep(...) site and ensure you stop the timer to avoid leaks.
---
Nitpick comments:
In `@management/internals/modules/reverseproxy/service/manager/expose_tracker.go`:
- Around line 52-53: The info log "reaping expired expose session for peer %s,
domain %s" should not assert success before deletion; change this log (at the
point where svc is checked and before deletion/lock re-check) to a
debug/trace-level message like "attempting to reap expired expose session for
peer %s, domain %s" and only emit an info-level message after you actually
remove the session (or emit a debug/no-op message if the lock re-check prevents
deletion); update the log call that references svc.SourcePeer and svc.Domain
accordingly so the pre-delete message doesn't sound like a completed action.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 1098c29a-0148-4529-9f95-99bbc30a796c
📒 Files selected for processing (1)
management/internals/modules/reverseproxy/service/manager/expose_tracker.go
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
management/internals/modules/reverseproxy/service/manager/manager.go (1)
57-59: Consider a nil-guard before starting the reaper.If
Manageris ever created withoutNewManager(e.g., struct literal in tests), this call can panic. A tiny guard makes startup more robust.Suggested hardening
func (m *Manager) StartExposeReaper(ctx context.Context) { + if m.exposeReaper == nil { + log.WithContext(ctx).Warn("expose reaper is not initialized") + return + } m.exposeReaper.StartExposeReaper(ctx) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@management/internals/modules/reverseproxy/service/manager/manager.go` around lines 57 - 59, Add a nil-guard in Manager.StartExposeReaper so it does not panic when Manager or its exposeReaper field is nil: check that m != nil and m.exposeReaper != nil before calling m.exposeReaper.StartExposeReaper(ctx), and return early if either is nil; update any callers/tests if they rely on the panic behavior. This change targets the Manager.StartExposeReaper method and the exposeReaper field usage.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@management/internals/modules/reverseproxy/service/manager/manager.go`:
- Around line 57-59: Add a nil-guard in Manager.StartExposeReaper so it does not
panic when Manager or its exposeReaper field is nil: check that m != nil and
m.exposeReaper != nil before calling m.exposeReaper.StartExposeReaper(ctx), and
return early if either is nil; update any callers/tests if they rely on the
panic behavior. This change targets the Manager.StartExposeReaper method and the
exposeReaper field usage.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 452adb64-7694-45ce-b9e7-9fc595f881ab
📒 Files selected for processing (1)
management/internals/modules/reverseproxy/service/manager/manager.go



Describe your changes
The expose tracker used sync.Map for in-memory TTL tracking of active expose sessions, which broke and lost all sessions on restart.
Replace with SQL-backed operations that reuse the existing meta_last_renewed_at column:
Issue ticket number and link
Stack
Checklist
Documentation
Select exactly one:
Docs PR URL (required if "docs added" is checked)
Paste the PR link from https://github.com/netbirdio/docs here:
https://github.com/netbirdio/docs/pull/__
Summary by CodeRabbit
New Features
Bug Fixes
Tests