[management] Refactor expose feature: move business logic from gRPC to manager#5435
[management] Refactor expose feature: move business logic from gRPC to manager#5435
Conversation
Consolidate all expose business logic (validation, permission checks, TTL tracking, reaping) into the manager layer, making the gRPC layer a pure transport adapter that only handles proto conversion and authentication. - Add ExposeServiceRequest/ExposeServiceResponse domain types with validation in the reverseproxy package - Move expose tracker (TTL tracking, reaping, per-peer limits) from gRPC server into manager/expose_tracker.go - Internalize tracking in CreateServiceFromPeer, RenewServiceFromPeer, and new StopServiceFromPeer so callers don't manage tracker state - Untrack ephemeral services in DeleteService/DeleteAllServices to keep tracker in sync when services are deleted via API - Simplify gRPC expose handlers to parse, auth, convert, delegate - Remove tracker methods from Manager interface (internal detail)
📝 WalkthroughWalkthroughMoves peer expose handling into an in-manager exposeTracker with TTL/reaper, replaces legacy peer APIs with request/response types (ExposeServiceRequest/ExposeServiceResponse), adds Renew/Stop/StartExposeReaper operations, updates callers/mocks/tests, and increases a client keep-alive interval from 10s to 30s. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant GRPC_Server as "gRPC Server"
participant Manager as "ReverseProxyManager"
participant Tracker as "exposeTracker"
participant Store
Client->>GRPC_Server: CreateExpose(ExposeServiceRequest)
GRPC_Server->>Manager: CreateServiceFromPeer(ctx, accountID, peerID, req)
Manager->>Tracker: TrackExposeIfAllowed(peerID, domain, accountID)
alt allowed
Manager->>Store: Create ephemeral Service record
Manager->>Tracker: register trackedExpose (lastRenewed)
Manager-->>GRPC_Server: ExposeServiceResponse(domain, serviceName, serviceURL)
GRPC_Server-->>Client: success
else denied
Manager-->>GRPC_Server: error
GRPC_Server-->>Client: error
end
loop every 30s
Tracker->>Tracker: reapExpiredExposes()
Tracker->>Manager: deleteServiceFromPeer(accountID, peerID, domain, expired=true)
Manager->>Store: delete Service
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
management/internals/modules/reverseproxy/manager/manager.go (1)
679-729:⚠️ Potential issue | 🔴 CriticalRace condition exists: concurrent requests can exceed peer expose limit.
CheckPeerExposeLimitWithLockacquiresexposeCreateMuonly for the check itself (line 98-100 in expose_tracker.go), then releases it immediately.TrackExpose(line 704) runs outside this lock after service persistence completes. Two concurrent requests can both pass the limit check and subsequently both track exposes, exceedingmaxExposesPerPeer.The
alreadyTrackedcheck at line 724 detects duplicate (peerID, domain) pairs but does not prevent limit overflow. Consolidate the check and track into a single atomic method that holds the lock throughout, or re-check the limit under the same lock before tracking.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@management/internals/modules/reverseproxy/manager/manager.go` around lines 679 - 729, The current flow checks the peer expose limit with exposeTracker.CheckPeerExposeLimitWithLock and then later calls exposeTracker.TrackExpose after persisting the service, allowing two concurrent requests to both pass the check and exceed the limit; fix this by consolidating the check+track into a single atomic operation under the same lock (or by re-checking the limit while holding the lock immediately before tracking). Add a new method on exposeTracker (e.g., CheckAndTrackExposeOrError(peerID, domain, accountID) implemented using exposeCreateMu) that verifies limit and existing domain and, if OK, records the expose atomically; call this new method in manager.go instead of separate CheckPeerExposeLimitWithLock and TrackExpose (and on alreadyTracked error keep the existing deleteServiceFromPeer + AlreadyExists handling). Ensure expose_tracker.go is updated to use the same exposeCreateMu and return clear status for "already tracked" vs "limit exceeded."
🧹 Nitpick comments (1)
management/internals/server/boot.go (1)
155-159: Use a lifecycle context for StartExposeReaperStarting the reaper with
context.Background()means it won’t stop on server shutdown or in tests. Consider passing a server lifecycle context (or storing a cancel) so the reaper can be stopped cleanly.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@management/internals/server/boot.go` around lines 155 - 159, StartExposeReaper is being launched with context.Background(), so it won't stop on server shutdown; change it to use the server lifecycle context or a cancellable context tied to the server so the reaper can be stopped. Locate ReverseProxyManager() and where srv.SetReverseProxyManager(reverseProxyMgr) is called and replace the context.Background() passed to reverseProxyMgr.StartExposeReaper with the server's lifecycle context (or create a context.WithCancel stored on the server and call cancel on shutdown) so the reaper is cancelled during server shutdown/tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@management/internals/modules/reverseproxy/manager/expose_tracker.go`:
- Around line 95-101: CheckPeerExposeLimitWithLock only locks while counting,
allowing a race where multiple goroutines pass the check and then call
TrackExpose concurrently; replace this with an atomic try-and-track operation:
add a new method (e.g., TryTrackPeerExpose or TrackExposeIfAllowed) that
acquires exposeCreateMu, calls CountPeerExposes(peerID) and if count <
maxExposesPerPeer calls existing TrackExpose logic (or increments the tracked
count) while still holding the lock, then returns success/failure. Update
callers (e.g., the TrackExpose call in manager.go) to use the new
TryTrackPeerExpose so the check and track happen under the same exposeCreateMu
lock.
- Around line 120-133: reapExpiredExposes can delete an expose that a concurrent
RenewTrackedExpose just refreshed; fix by adding an expiring guard on
trackedExpose (e.g., an atomic/boolean field named expiring) and use it as a CAS
under the same lock: in reapExpiredExposes, when you detect expiry acquire
expose.mu, check expiry, set expiring=true (or CAS true) before unlocking and
then proceed to LoadAndDelete and deletion; in RenewTrackedExpose, acquire
expose.mu or use CAS to refuse/observe renewals when expiring is true (return
false/error), so a renewal cannot succeed once the reaper has marked the expose
as expiring; update references to trackedExpose, reapExpiredExposes and
RenewTrackedExpose accordingly.
In `@management/internals/modules/reverseproxy/reverseproxy.go`:
- Around line 480-521: ExposeServiceRequest.Validate must guard against a nil
receiver and validate the port upper bound; update the Validate method to first
return an error if the receiver is nil (e.g., "request cannot be nil"), then
change the port check to require 1 <= Port <= 65535 (replace the r.Port <= 0
check with a range check and an appropriate error message); keep the existing
protocol, pin, user group and NamePrefix checks unchanged in Validate to
preserve current validations.
---
Outside diff comments:
In `@management/internals/modules/reverseproxy/manager/manager.go`:
- Around line 679-729: The current flow checks the peer expose limit with
exposeTracker.CheckPeerExposeLimitWithLock and then later calls
exposeTracker.TrackExpose after persisting the service, allowing two concurrent
requests to both pass the check and exceed the limit; fix this by consolidating
the check+track into a single atomic operation under the same lock (or by
re-checking the limit while holding the lock immediately before tracking). Add a
new method on exposeTracker (e.g., CheckAndTrackExposeOrError(peerID, domain,
accountID) implemented using exposeCreateMu) that verifies limit and existing
domain and, if OK, records the expose atomically; call this new method in
manager.go instead of separate CheckPeerExposeLimitWithLock and TrackExpose (and
on alreadyTracked error keep the existing deleteServiceFromPeer + AlreadyExists
handling). Ensure expose_tracker.go is updated to use the same exposeCreateMu
and return clear status for "already tracked" vs "limit exceeded."
---
Nitpick comments:
In `@management/internals/server/boot.go`:
- Around line 155-159: StartExposeReaper is being launched with
context.Background(), so it won't stop on server shutdown; change it to use the
server lifecycle context or a cancellable context tied to the server so the
reaper can be stopped. Locate ReverseProxyManager() and where
srv.SetReverseProxyManager(reverseProxyMgr) is called and replace the
context.Background() passed to reverseProxyMgr.StartExposeReaper with the
server's lifecycle context (or create a context.WithCancel stored on the server
and call cancel on shutdown) so the reaper is cancelled during server
shutdown/tests.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (17)
client/internal/expose/manager.gomanagement/internals/modules/reverseproxy/interface.gomanagement/internals/modules/reverseproxy/interface_mock.gomanagement/internals/modules/reverseproxy/manager/expose_tracker.gomanagement/internals/modules/reverseproxy/manager/expose_tracker_test.gomanagement/internals/modules/reverseproxy/manager/manager.gomanagement/internals/modules/reverseproxy/manager/manager_test.gomanagement/internals/modules/reverseproxy/reverseproxy.gomanagement/internals/modules/reverseproxy/reverseproxy_test.gomanagement/internals/server/boot.gomanagement/internals/shared/grpc/expose_service.gomanagement/internals/shared/grpc/expose_service_test.gomanagement/internals/shared/grpc/proxy_group_access_test.gomanagement/internals/shared/grpc/server.gomanagement/internals/shared/grpc/validate_session_test.gomanagement/server/http/handlers/proxy/auth_callback_integration_test.goproxy/management_integration_test.go
💤 Files with no reviewable changes (2)
- management/internals/shared/grpc/server.go
- management/internals/shared/grpc/expose_service_test.go
… tests - Replace `TrackExpose` with `TrackExposeIfAllowed` for atomic limit checks and tracking. - Add `expiring` flag to detect reaping in progress during renewal. - Update `RenewTrackedExpose` to reject expiring entries. - Refactor reaping to set `expiring` before deletion. - Validate ports to be within range (1-65535) or fail gracefully. - Enhance test coverage for limits, renewal, reaping, and validation edge cases.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
management/internals/modules/reverseproxy/manager/manager.go (1)
687-693:⚠️ Potential issue | 🟠 MajorGuard against nil clusterDeriver when Domain is omitted.
If
clusterDeriveris nil andreq.Domainis empty,buildRandomDomainwill panic when callingGetClusterDomains(). Add a guard to return a clear error when no cluster domains are configured.🛡️ Suggested guard
if service.Domain == "" { + if m.clusterDeriver == nil { + return nil, status.Errorf(status.PreconditionFailed, "no cluster domains configured for peer expose") + } domain, err := m.buildRandomDomain(service.Name) if err != nil { return nil, fmt.Errorf("build random domain for service %s: %w", service.Name, err) } service.Domain = domain }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@management/internals/modules/reverseproxy/manager/manager.go` around lines 687 - 693, When omitting Domain (service.Domain / req.Domain) and before calling buildRandomDomain, add a guard that checks whether clusterDeriver is nil (or inside buildRandomDomain check its receiver) and return a clear error like "no cluster domains configured" instead of allowing a panic from GetClusterDomains(); update the code paths that call buildRandomDomain (e.g., in the block that sets service.Domain) to propagate this error so callers get a readable failure rather than a nil-pointer panic (refer to symbols: clusterDeriver, buildRandomDomain, GetClusterDomains, service.Domain).
🧹 Nitpick comments (2)
management/internals/modules/reverseproxy/reverseproxy.go (1)
528-548: Validate the custom domain suffix before concatenation.When
Domainis provided,ToServicebuildsserviceName + "." + r.Domain; malformed suffixes (leading dot/invalid DNS) will generate an invalid FQDN and URL. Consider rejecting invalid suffixes inValidate().♻️ Suggested validation
func (r *ExposeServiceRequest) Validate() error { @@ if r.NamePrefix != "" && !validNamePrefix.MatchString(r.NamePrefix) { return fmt.Errorf("invalid name prefix %q: must be lowercase alphanumeric with optional hyphens, 1-32 characters", r.NamePrefix) } + if r.Domain != "" && !net.IsDomainName(r.Domain) { + return fmt.Errorf("invalid domain %q", r.Domain) + } return nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@management/internals/modules/reverseproxy/reverseproxy.go` around lines 528 - 548, The ToService method concatenates serviceName + "." + r.Domain without validating r.Domain, which can produce invalid FQDNs; update the ExposeServiceRequest.Validate() method to reject or normalize bad domain suffixes (e.g., disallow leading dots, empty labels, spaces, or characters not allowed in DNS labels and ensure each label length and overall length are valid) and return an error for invalid values; reference the ToService method and the ExposeServiceRequest.Validate() function so callers only reach ToService with a vetted r.Domain (or trim a single leading dot in Validate() if normalization is preferred).management/internals/modules/reverseproxy/manager/expose_tracker.go (1)
131-146: Consider retrying reaper deletes on transient failures.The tracking entry is removed before
deleteServiceFromPeercompletes. If deletion fails, the expired service becomes untracked and won’t be retried. Consider deleting only after a successful cleanup (or reinsert on error).♻️ Possible ordering tweak
if expired { - t.activeExposes.Delete(key) - log.Infof("reaping expired expose session for peer %s, domain %s", expose.peerID, expose.domain) - if err := t.manager.deleteServiceFromPeer(context.Background(), expose.accountID, expose.peerID, expose.domain, true); err != nil { - log.Errorf("failed to delete expired peer-exposed service for domain %s: %v", expose.domain, err) - } + if err := t.manager.deleteServiceFromPeer(context.Background(), expose.accountID, expose.peerID, expose.domain, true); err != nil { + log.Errorf("failed to delete expired peer-exposed service for domain %s: %v", expose.domain, err) + return true + } + t.activeExposes.Delete(key) + log.Infof("reaping expired expose session for peer %s, domain %s", expose.peerID, expose.domain) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@management/internals/modules/reverseproxy/manager/expose_tracker.go` around lines 131 - 146, reapExpiredExposes currently removes the trackedExpose from t.activeExposes before calling t.manager.deleteServiceFromPeer, which means a transient delete error leaves the expose untracked; change the logic in reapExpiredExposes so that you call deleteServiceFromPeer first and only call t.activeExposes.Delete(key) after a successful delete, or if delete fails reinsert or leave the trackedExpose in t.activeExposes (and clear expiring flag) so it can be retried later; adjust use of trackedExpose.expiring and locking around expose.mu to ensure consistency when retrying deletes and avoid races.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@management/internals/modules/reverseproxy/manager/manager.go`:
- Around line 785-794: StopServiceFromPeer currently calls
m.exposeTracker.StopTrackedExpose(peerID, domain) before calling
m.deleteServiceFromPeer, which causes loss of tracking if deletion fails; change
the flow to only untrack after deleteServiceFromPeer succeeds, or if you must
untrack first then restore tracking on error by re-adding the track entry.
Concretely, update StopServiceFromPeer so that it calls
m.deleteServiceFromPeer(ctx, accountID, peerID, domain, false) first and only
then calls m.exposeTracker.StopTrackedExpose(peerID, domain) on success, or if
maintaining the current order, catch errors from deleteServiceFromPeer and call
the appropriate exposeTracker method to restore tracking (the inverse of
StopTrackedExpose) using the same peerID and domain before returning the error.
---
Outside diff comments:
In `@management/internals/modules/reverseproxy/manager/manager.go`:
- Around line 687-693: When omitting Domain (service.Domain / req.Domain) and
before calling buildRandomDomain, add a guard that checks whether clusterDeriver
is nil (or inside buildRandomDomain check its receiver) and return a clear error
like "no cluster domains configured" instead of allowing a panic from
GetClusterDomains(); update the code paths that call buildRandomDomain (e.g., in
the block that sets service.Domain) to propagate this error so callers get a
readable failure rather than a nil-pointer panic (refer to symbols:
clusterDeriver, buildRandomDomain, GetClusterDomains, service.Domain).
---
Nitpick comments:
In `@management/internals/modules/reverseproxy/manager/expose_tracker.go`:
- Around line 131-146: reapExpiredExposes currently removes the trackedExpose
from t.activeExposes before calling t.manager.deleteServiceFromPeer, which means
a transient delete error leaves the expose untracked; change the logic in
reapExpiredExposes so that you call deleteServiceFromPeer first and only call
t.activeExposes.Delete(key) after a successful delete, or if delete fails
reinsert or leave the trackedExpose in t.activeExposes (and clear expiring flag)
so it can be retried later; adjust use of trackedExpose.expiring and locking
around expose.mu to ensure consistency when retrying deletes and avoid races.
In `@management/internals/modules/reverseproxy/reverseproxy.go`:
- Around line 528-548: The ToService method concatenates serviceName + "." +
r.Domain without validating r.Domain, which can produce invalid FQDNs; update
the ExposeServiceRequest.Validate() method to reject or normalize bad domain
suffixes (e.g., disallow leading dots, empty labels, spaces, or characters not
allowed in DNS labels and ensure each label length and overall length are valid)
and return an error for invalid values; reference the ToService method and the
ExposeServiceRequest.Validate() function so callers only reach ToService with a
vetted r.Domain (or trim a single leading dot in Validate() if normalization is
preferred).
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
management/internals/modules/reverseproxy/manager/expose_tracker.gomanagement/internals/modules/reverseproxy/manager/expose_tracker_test.gomanagement/internals/modules/reverseproxy/manager/manager.gomanagement/internals/modules/reverseproxy/manager/manager_test.gomanagement/internals/modules/reverseproxy/reverseproxy.go
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (4)
management/internals/modules/reverseproxy/manager/manager.go (3)
751-754: Returnnilinstead of[]string{}alongside an error.Per Go conventions, the non-error return value should be
nilwhen returning an error; a non-nil empty slice alongside an error is misleading to callers.♻️ Proposed fix
if len(groupNames) == 0 { - return []string{}, fmt.Errorf("no group names provided") + return nil, fmt.Errorf("no group names provided") }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@management/internals/modules/reverseproxy/manager/manager.go` around lines 751 - 754, In getGroupIDsFromNames (method on managerImpl) the function currently returns an empty non-nil slice ([]string{}) when it also returns an error; change this to return nil for the []string return value whenever an error is returned (for example in the early check for len(groupNames) == 0 and any other error paths) so callers receive a nil slice alongside the error per Go conventions.
716-732: Service is persisted before the tracker limit check; failure triggers an unnecessary write+delete cycle.
persistNewService(line 716) commits the service to the DB beforeTrackExposeIfAllowed(line 720). When the per-peer limit is exceeded or a duplicate is found, the service is immediately deleted viadeleteServiceFromPeer. A misbehaving or retrying client can cause repeated write→delete cycles at DB level.The current ordering is architecturally necessary (domain uniqueness is enforced inside the transaction), but consider adding a pre-flight limit check (
CountPeerExposes/MaxExposesPerPeer) beforepersistNewServiceas an early-exit optimisation that avoids the DB round-trip for the common "already at limit" case. The authoritative atomic check inTrackExposeIfAllowedwould still be the source of truth.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@management/internals/modules/reverseproxy/manager/manager.go` around lines 716 - 732, persistNewService is called before TrackExposeIfAllowed, causing unnecessary DB write+delete when a peer is at its expose limit; add a pre-flight check using CountPeerExposes (or equivalent) against MaxExposesPerPeer before calling persistNewService and return the PreconditionFailed error early if the count >= max to avoid the round-trip; keep the existing TrackExposeIfAllowed call after persisting (it remains the authoritative atomic check) and retain the deleteServiceFromPeer cleanup for the rare race/duplicate case.
670-749: Ephemeral services in the DB are not recovered into the tracker after a server restart.The
exposeTrackeris entirely in-memory. On restart it is empty. Any ephemeral services that were tracked before the restart remain in the DB but are invisible to the reaper. Consequences:
- Reap gap: Stale ephemeral services are never cleaned up.
- Client reconnect conflict: If a peer reconnects and requests the same explicit domain,
checkDomainAvailablereturnsAlreadyExists. For random domains a new domain is assigned and the old one silently leaks.Consider either:
- A startup sweep that loads all
SourceEphemeralservices into the tracker (restoring their TTL countdown), or- A startup sweep that bulk-deletes all
SourceEphemeralservices (clean-slate approach).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@management/internals/modules/reverseproxy/manager/manager.go` around lines 670 - 749, Startup must repopulate the in-memory exposeTracker for ephemeral DB rows: add a startup sweep in the manager initialization (e.g., NewManager/Start method on managerImpl) that queries the store for all services with Source == reverseproxy.SourceEphemeral and for each service either re-track it via m.exposeTracker.TrackExposeIfAllowed(service.SourcePeer, service.Domain, service.AccountID) (restore TTL/counts) or, if you prefer the clean-slate approach, call m.deleteServiceFromPeer(ctx, service.AccountID, service.SourcePeer, service.Domain, false) to remove them; ensure the sweep uses m.store read methods to list ephemeral services, handles errors per-service (log and continue), and runs before accepting new CreateServiceFromPeer requests so CreateServiceFromPeer (and checkDomainAvailable/TrackExposeIfAllowed usage) sees a consistent tracker state.management/internals/modules/reverseproxy/manager/expose_tracker.go (1)
131-150: Thread the reaper's context intoreapExpiredExposesinstead of usingcontext.Background().
reapExpiredExposesis invoked from the goroutine that respects the parent context, yet the deletion call inside it uses a detachedcontext.Background(). If the manager shuts down, in-flight deletions won't propagate cancellation. Passing the context as a parameter is a small but cleaner improvement.♻️ Proposed refactor
-func (t *exposeTracker) reapExpiredExposes() { +func (t *exposeTracker) reapExpiredExposes(ctx context.Context) { t.activeExposes.Range(func(key, val any) bool { ... - if err := t.manager.deleteServiceFromPeer(context.Background(), ...); err != nil { + if err := t.manager.deleteServiceFromPeer(ctx, ...); err != nil {And in
StartExposeReaper:case <-ticker.C: - t.reapExpiredExposes() + t.reapExpiredExposes(ctx)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@management/internals/modules/reverseproxy/manager/expose_tracker.go` around lines 131 - 150, The reaper uses context.Background() inside reapExpiredExposes which prevents cancellation propagation; change reapExpiredExposes to accept a context (e.g., func (t *exposeTracker) reapExpiredExposes(ctx context.Context)) and update the caller (StartExposeReaper) to pass the parent/context it already respects, then use that ctx when calling t.manager.deleteServiceFromPeer(ctx, expose.accountID, expose.peerID, expose.domain, true) so in-flight deletions are cancelled when the manager shuts down.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@management/internals/modules/reverseproxy/manager/expose_tracker.go`:
- Around line 141-148: The reaper currently logs every error from
t.manager.deleteServiceFromPeer as an error even when the service was already
removed concurrently; update the expired-branch error handling to treat NotFound
as a no-op (or log at Debug) instead of Errorf: after calling
t.manager.deleteServiceFromPeer, if err != nil check errors.Is(err,
<store/manager NotFound sentinel>) (or use any existing helper like
t.manager.IsNotFound(err)); if it's a NotFound return log.Debugf("already
deleted ...") or skip logging, else keep the existing log.Errorf and
t.activeExposes.Delete(key) behavior for successful deletes. Ensure you
reference t.manager.deleteServiceFromPeer, the NotFound sentinel/helper, and
t.activeExposes.Delete when making the change.
In `@management/internals/modules/reverseproxy/manager/manager.go`:
- Around line 789-800: In StopServiceFromPeer, change the behavior when
exposeTracker.StopTrackedExpose(peerID, domain) returns false after a successful
deleteServiceFromPeer: instead of returning status.NotFound, log a warning
(including peerID and domain) that the tracker entry was missing after deletion
and then return nil; keep the existing error path when deleteServiceFromPeer
fails. This modifies managerImpl.StopServiceFromPeer to treat StopTrackedExpose
returning false as a non-fatal, unexpected condition (log context) rather than
an error, preserving correct semantics if the unreachable case ever occurs.
---
Nitpick comments:
In `@management/internals/modules/reverseproxy/manager/expose_tracker.go`:
- Around line 131-150: The reaper uses context.Background() inside
reapExpiredExposes which prevents cancellation propagation; change
reapExpiredExposes to accept a context (e.g., func (t *exposeTracker)
reapExpiredExposes(ctx context.Context)) and update the caller
(StartExposeReaper) to pass the parent/context it already respects, then use
that ctx when calling t.manager.deleteServiceFromPeer(ctx, expose.accountID,
expose.peerID, expose.domain, true) so in-flight deletions are cancelled when
the manager shuts down.
In `@management/internals/modules/reverseproxy/manager/manager.go`:
- Around line 751-754: In getGroupIDsFromNames (method on managerImpl) the
function currently returns an empty non-nil slice ([]string{}) when it also
returns an error; change this to return nil for the []string return value
whenever an error is returned (for example in the early check for
len(groupNames) == 0 and any other error paths) so callers receive a nil slice
alongside the error per Go conventions.
- Around line 716-732: persistNewService is called before TrackExposeIfAllowed,
causing unnecessary DB write+delete when a peer is at its expose limit; add a
pre-flight check using CountPeerExposes (or equivalent) against
MaxExposesPerPeer before calling persistNewService and return the
PreconditionFailed error early if the count >= max to avoid the round-trip; keep
the existing TrackExposeIfAllowed call after persisting (it remains the
authoritative atomic check) and retain the deleteServiceFromPeer cleanup for the
rare race/duplicate case.
- Around line 670-749: Startup must repopulate the in-memory exposeTracker for
ephemeral DB rows: add a startup sweep in the manager initialization (e.g.,
NewManager/Start method on managerImpl) that queries the store for all services
with Source == reverseproxy.SourceEphemeral and for each service either re-track
it via m.exposeTracker.TrackExposeIfAllowed(service.SourcePeer, service.Domain,
service.AccountID) (restore TTL/counts) or, if you prefer the clean-slate
approach, call m.deleteServiceFromPeer(ctx, service.AccountID,
service.SourcePeer, service.Domain, false) to remove them; ensure the sweep uses
m.store read methods to list ephemeral services, handles errors per-service (log
and continue), and runs before accepting new CreateServiceFromPeer requests so
CreateServiceFromPeer (and checkDomainAvailable/TrackExposeIfAllowed usage) sees
a consistent tracker state.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
management/internals/modules/reverseproxy/manager/expose_tracker.gomanagement/internals/modules/reverseproxy/manager/manager.go
- Improve error handling in expose session reaping logic. - Log specific cases like already deleted services as debug. - Ensure unhandled errors are logged as errors.
|
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@management/internals/modules/reverseproxy/manager/expose_tracker.go`:
- Around line 148-158: The call to status.FromError is unguarded and s may be
nil, causing a panic when accessing s.ErrorType; update the error handling in
the block around t.manager.deleteServiceFromPeer(...) so you first check err: if
err == nil delete the key via t.activeExposes.Delete(key), otherwise call s,ok
:= status.FromError(err) and only inspect s.ErrorType when ok is true (handle
status.NotFound there); for non-status errors fall through to the default
logging branch that logs the error for expose.domain. Ensure you reference
t.manager.deleteServiceFromPeer, status.FromError, s (or s,ok),
t.activeExposes.Delete and expose.domain when making the change.
In `@management/internals/modules/reverseproxy/manager/manager.go`:
- Around line 720-731: The code masks failures from m.deleteServiceFromPeer when
TrackExposeIfAllowed indicates a duplicate or limit exceed; if
deleteServiceFromPeer fails the service is left persisted but the function
returns AlreadyExists/PreconditionFailed. Update the branches that call
m.deleteServiceFromPeer (the block using m.exposeTracker.TrackExposeIfAllowed,
peerID, service.Domain, accountID, maxExposesPerPeer) to check the delete error
and return it (wrap as a meaningful gRPC status like Internal with context)
instead of swallowing it—only return AlreadyExists/PreconditionFailed when the
cleanup succeeded.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
management/internals/modules/reverseproxy/manager/expose_tracker.gomanagement/internals/modules/reverseproxy/manager/manager.go
| err := t.manager.deleteServiceFromPeer(context.Background(), expose.accountID, expose.peerID, expose.domain, true) | ||
|
|
||
| s, _ := status.FromError(err) | ||
|
|
||
| switch { | ||
| case err == nil: | ||
| t.activeExposes.Delete(key) | ||
| case s.ErrorType == status.NotFound: | ||
| log.Debugf("service %s was already deleted", expose.domain) | ||
| default: | ||
| log.Errorf("failed to delete expired peer-exposed service for domain %s: %v", expose.domain, err) |
There was a problem hiding this comment.
Guard status.FromError to avoid nil deref panic.
Line 150 assigns s without checking ok. If deleteServiceFromPeer returns a non-status.Error, s is nil and s.ErrorType will panic, crashing the reaper goroutine.
🛠️ Suggested fix
- s, _ := status.FromError(err)
+ s, ok := status.FromError(err)
switch {
case err == nil:
t.activeExposes.Delete(key)
- case s.ErrorType == status.NotFound:
+ case ok && s != nil && s.ErrorType == status.NotFound:
log.Debugf("service %s was already deleted", expose.domain)
default:
log.Errorf("failed to delete expired peer-exposed service for domain %s: %v", expose.domain, err)
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| err := t.manager.deleteServiceFromPeer(context.Background(), expose.accountID, expose.peerID, expose.domain, true) | |
| s, _ := status.FromError(err) | |
| switch { | |
| case err == nil: | |
| t.activeExposes.Delete(key) | |
| case s.ErrorType == status.NotFound: | |
| log.Debugf("service %s was already deleted", expose.domain) | |
| default: | |
| log.Errorf("failed to delete expired peer-exposed service for domain %s: %v", expose.domain, err) | |
| err := t.manager.deleteServiceFromPeer(context.Background(), expose.accountID, expose.peerID, expose.domain, true) | |
| s, ok := status.FromError(err) | |
| switch { | |
| case err == nil: | |
| t.activeExposes.Delete(key) | |
| case ok && s != nil && s.ErrorType == status.NotFound: | |
| log.Debugf("service %s was already deleted", expose.domain) | |
| default: | |
| log.Errorf("failed to delete expired peer-exposed service for domain %s: %v", expose.domain, err) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@management/internals/modules/reverseproxy/manager/expose_tracker.go` around
lines 148 - 158, The call to status.FromError is unguarded and s may be nil,
causing a panic when accessing s.ErrorType; update the error handling in the
block around t.manager.deleteServiceFromPeer(...) so you first check err: if err
== nil delete the key via t.activeExposes.Delete(key), otherwise call s,ok :=
status.FromError(err) and only inspect s.ErrorType when ok is true (handle
status.NotFound there); for non-status errors fall through to the default
logging branch that logs the error for expose.domain. Ensure you reference
t.manager.deleteServiceFromPeer, status.FromError, s (or s,ok),
t.activeExposes.Delete and expose.domain when making the change.
There was a problem hiding this comment.
will fix in a new Pr
There was a problem hiding this comment.
@mlsmaycon, understood! Would you like me to create a GitHub issue to track this nil pointer dereference fix for the follow-up PR? I can include the context and suggested fix to make it easier to address later.
| alreadyTracked, allowed := m.exposeTracker.TrackExposeIfAllowed(peerID, service.Domain, accountID) | ||
| if alreadyTracked { | ||
| if err := m.deleteServiceFromPeer(ctx, accountID, peerID, service.Domain, false); err != nil { | ||
| log.WithContext(ctx).Debugf("failed to delete duplicate expose service for domain %s: %v", service.Domain, err) | ||
| } | ||
| return nil, status.Errorf(status.AlreadyExists, "peer already has an active expose session for this domain") | ||
| } | ||
| if !allowed { | ||
| if err := m.deleteServiceFromPeer(ctx, accountID, peerID, service.Domain, false); err != nil { | ||
| log.WithContext(ctx).Debugf("failed to delete service after limit exceeded for domain %s: %v", service.Domain, err) | ||
| } | ||
| return nil, status.Errorf(status.PreconditionFailed, "peer has reached the maximum number of active expose sessions (%d)", maxExposesPerPeer) |
There was a problem hiding this comment.
Don’t mask cleanup failures after duplicate/limit rejection.
If deleteServiceFromPeer fails, the newly persisted service remains but the caller receives AlreadyExists/PreconditionFailed, leaving an orphaned, untracked service. Surface the cleanup failure (and avoid claiming the service doesn’t exist).
🛠️ Suggested fix
if alreadyTracked {
if err := m.deleteServiceFromPeer(ctx, accountID, peerID, service.Domain, false); err != nil {
- log.WithContext(ctx).Debugf("failed to delete duplicate expose service for domain %s: %v", service.Domain, err)
+ log.WithContext(ctx).Errorf("failed to delete duplicate expose service for domain %s: %v", service.Domain, err)
+ return nil, status.Errorf(status.Internal, "cleanup failed for domain %s: %v", service.Domain, err)
}
return nil, status.Errorf(status.AlreadyExists, "peer already has an active expose session for this domain")
}
if !allowed {
if err := m.deleteServiceFromPeer(ctx, accountID, peerID, service.Domain, false); err != nil {
- log.WithContext(ctx).Debugf("failed to delete service after limit exceeded for domain %s: %v", service.Domain, err)
+ log.WithContext(ctx).Errorf("failed to delete service after limit exceeded for domain %s: %v", service.Domain, err)
+ return nil, status.Errorf(status.Internal, "cleanup failed for domain %s: %v", service.Domain, err)
}
return nil, status.Errorf(status.PreconditionFailed, "peer has reached the maximum number of active expose sessions (%d)", maxExposesPerPeer)
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@management/internals/modules/reverseproxy/manager/manager.go` around lines
720 - 731, The code masks failures from m.deleteServiceFromPeer when
TrackExposeIfAllowed indicates a duplicate or limit exceed; if
deleteServiceFromPeer fails the service is left persisted but the function
returns AlreadyExists/PreconditionFailed. Update the branches that call
m.deleteServiceFromPeer (the block using m.exposeTracker.TrackExposeIfAllowed,
peerID, service.Domain, accountID, maxExposesPerPeer) to check the delete error
and return it (wrap as a meaningful gRPC status like Internal with context)
instead of swallowing it—only return AlreadyExists/PreconditionFailed when the
cleanup succeeded.
There was a problem hiding this comment.
will refactor this in a new Pr
There was a problem hiding this comment.
@mlsmaycon, understood! Would you like me to create a GitHub issue to track this refactoring work for the error handling in the expose service cleanup flow?
🧠 Learnings used
Learnt from: lixmal
Repo: netbirdio/netbird PR: 5202
File: client/internal/portforward/manager.go:219-222
Timestamp: 2026-01-30T14:43:13.106Z
Learning: In client/internal/portforward/manager.go, the state persistence pattern is designed for retry-on-failure: if DeletePortMapping fails in Stop(), the state is intentionally left intact (via early return) so that cleanupResidual() can attempt deletion again on the next startup. This is the intended behavior, not a bug.



Describe your changes
Consolidate all expose business logic (validation, permission checks, TTL tracking, reaping) into the manager layer, making the gRPC layer a pure transport adapter that only handles proto conversion and authentication.
Issue ticket number and link
Stack
Checklist
Documentation
Select exactly one:
Docs PR URL (required if "docs added" is checked)
Paste the PR link from https://github.com/netbirdio/docs here:
https://github.com/netbirdio/docs/pull/__
Summary by CodeRabbit
New Features
Tests
Chores