Skip to content

[management] Refactor expose feature: move business logic from gRPC to manager#5435

Merged
mlsmaycon merged 5 commits intomainfrom
refactor/move-expose-renew-to-manager
Feb 24, 2026
Merged

[management] Refactor expose feature: move business logic from gRPC to manager#5435
mlsmaycon merged 5 commits intomainfrom
refactor/move-expose-renew-to-manager

Conversation

@mlsmaycon
Copy link
Copy Markdown
Collaborator

@mlsmaycon mlsmaycon commented Feb 24, 2026

Describe your changes

Consolidate all expose business logic (validation, permission checks, TTL tracking, reaping) into the manager layer, making the gRPC layer a pure transport adapter that only handles proto conversion and authentication.

  • Add ExposeServiceRequest/ExposeServiceResponse domain types with validation in the reverseproxy package
  • Move expose tracker (TTL tracking, reaping, per-peer limits) from gRPC server into manager/expose_tracker.go
  • Internalize tracking in CreateServiceFromPeer, RenewServiceFromPeer, and new StopServiceFromPeer so callers don't manage tracker state
  • Untrack ephemeral services in DeleteService/DeleteAllServices to keep tracker in sync when services are deleted via API
  • Simplify gRPC expose handlers to parse, auth, convert, delegate
  • Remove tracker methods from Manager interface (internal detail)

Issue ticket number and link

Stack

Checklist

  • Is it a bug fix
  • Is a typo/documentation fix
  • Is a feature enhancement
  • It is a refactor
  • Created tests that fail without the change (if possible)

By submitting this pull request, you confirm that you have read and agree to the terms of the Contributor License Agreement.

Documentation

Select exactly one:

  • I added/updated documentation for this change
  • Documentation is not needed for this change (explain why)

Docs PR URL (required if "docs added" is checked)

Paste the PR link from https://github.com/netbirdio/docs here:

https://github.com/netbirdio/docs/pull/__

Summary by CodeRabbit

  • New Features

    • Peer-exposed services can be renewed or stopped on demand.
    • Background cleanup (reaper) automatically removes expired exposes.
    • Expose API now accepts validated structured requests and returns service name/URL/domain.
  • Tests

    • Expanded coverage for expose lifecycle, limits, expiry and concurrency.
  • Chores

    • Keep-alive interval extended from 10s to 30s.

Consolidate all expose business logic (validation, permission checks,
TTL tracking, reaping) into the manager layer, making the gRPC layer
a pure transport adapter that only handles proto conversion and authentication.

- Add ExposeServiceRequest/ExposeServiceResponse domain types with
  validation in the reverseproxy package
- Move expose tracker (TTL tracking, reaping, per-peer limits) from
  gRPC server into manager/expose_tracker.go
- Internalize tracking in CreateServiceFromPeer, RenewServiceFromPeer,
  and new StopServiceFromPeer so callers don't manage tracker state
- Untrack ephemeral services in DeleteService/DeleteAllServices to
  keep tracker in sync when services are deleted via API
- Simplify gRPC expose handlers to parse, auth, convert, delegate
- Remove tracker methods from Manager interface (internal detail)
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Feb 24, 2026

📝 Walkthrough

Walkthrough

Moves peer expose handling into an in-manager exposeTracker with TTL/reaper, replaces legacy peer APIs with request/response types (ExposeServiceRequest/ExposeServiceResponse), adds Renew/Stop/StartExposeReaper operations, updates callers/mocks/tests, and increases a client keep-alive interval from 10s to 30s.

Changes

Cohort / File(s) Summary
Client Keep-Alive Configuration
client/internal/expose/manager.go
Manager.KeepAlive ticker interval changed from 10s → 30s.
Reverse Proxy Interface & Types
management/internals/modules/reverseproxy/interface.go, management/internals/modules/reverseproxy/reverseproxy.go
Replaced old peer methods (removed ValidateExposePermission, DeleteServiceFromPeer, ExpireServiceFromPeer). CreateServiceFromPeer now takes ExposeServiceRequest and returns ExposeServiceResponse. Added RenewServiceFromPeer, StopServiceFromPeer, StartExposeReaper, and new request/response types plus validation/ToService helpers.
Expose Tracker Component
management/internals/modules/reverseproxy/manager/expose_tracker.go, management/internals/modules/reverseproxy/manager/expose_tracker_test.go
New internal exposeTracker using sync.Map and per-expose mutex; TTL 90s, reap interval 30s, max per-peer 10; provides Track/Untrack/Count/Renew/Stop and background reaper with unit tests.
Reverse Proxy Manager Implementation
management/internals/modules/reverseproxy/manager/manager.go, management/internals/modules/reverseproxy/manager/manager_test.go
Integrated exposeTracker into managerImpl; refactored CreateServiceFromPeer to validate request, enforce per-peer limits, create ephemeral services and return ExposeServiceResponse; added Renew/Stop, domain-based delete/lookup helpers, renamed ValidateExposePermission→validateExposePermission; updated tests.
Mocks & Test Stubs
management/internals/modules/reverseproxy/interface_mock.go, management/internals/shared/grpc/proxy_group_access_test.go, management/internals/shared/grpc/validate_session_test.go, management/server/http/handlers/proxy/auth_callback_integration_test.go, proxy/management_integration_test.go
Updated mocks/test stubs to new signatures (ExposeServiceRequest/ExposeServiceResponse); removed ValidateExposePermission; renamed Delete→Renew and Expire→Stop; added StartExposeReaper no-ops.
gRPC Expose Service Server
management/internals/shared/grpc/expose_service.go, management/internals/shared/grpc/expose_service_test.go, management/internals/shared/grpc/server.go
Removed server-local expose state and reaping logic; CreateExpose now builds ExposeServiceRequest and delegates to manager; introduced unified error mapping; deleted old expose tests and removed Server.activeExposes/exposeCreateMu.
Server Boot Initialization
management/internals/server/boot.go
Guarded StartExposeReaper call with nil-check on reverse proxy manager before starting reaper.
Integration & Handler Tests
management/server/http/handlers/..., proxy/management_integration_test.go
Adjusted integration/handler tests and test managers to new request/response APIs, renamed/added methods, and StartExposeReaper stubs.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant GRPC_Server as "gRPC Server"
    participant Manager as "ReverseProxyManager"
    participant Tracker as "exposeTracker"
    participant Store

    Client->>GRPC_Server: CreateExpose(ExposeServiceRequest)
    GRPC_Server->>Manager: CreateServiceFromPeer(ctx, accountID, peerID, req)
    Manager->>Tracker: TrackExposeIfAllowed(peerID, domain, accountID)
    alt allowed
        Manager->>Store: Create ephemeral Service record
        Manager->>Tracker: register trackedExpose (lastRenewed)
        Manager-->>GRPC_Server: ExposeServiceResponse(domain, serviceName, serviceURL)
        GRPC_Server-->>Client: success
    else denied
        Manager-->>GRPC_Server: error
        GRPC_Server-->>Client: error
    end

    loop every 30s
        Tracker->>Tracker: reapExpiredExposes()
        Tracker->>Manager: deleteServiceFromPeer(accountID, peerID, domain, expired=true)
        Manager->>Store: delete Service
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • pascal-fischer

Poem

🐰 Hopping through code with a twitch of my nose,
I track little exposes where wild TTL goes.
Requests now tidy, responses all neat,
The reaper hums softly — expires take a seat.
Carrots for tests, and logs that sing sweet.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 13.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main refactoring effort: moving expose business logic from gRPC to the manager layer, which is the primary objective across all changed files.
Description check ✅ Passed The description provides comprehensive details about the changes, objectives, and includes proper checklist completion with the refactor checkbox marked. Documentation is appropriately marked as not needed with explanation.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch refactor/move-expose-renew-to-manager

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
management/internals/modules/reverseproxy/manager/manager.go (1)

679-729: ⚠️ Potential issue | 🔴 Critical

Race condition exists: concurrent requests can exceed peer expose limit.

CheckPeerExposeLimitWithLock acquires exposeCreateMu only for the check itself (line 98-100 in expose_tracker.go), then releases it immediately. TrackExpose (line 704) runs outside this lock after service persistence completes. Two concurrent requests can both pass the limit check and subsequently both track exposes, exceeding maxExposesPerPeer.

The alreadyTracked check at line 724 detects duplicate (peerID, domain) pairs but does not prevent limit overflow. Consolidate the check and track into a single atomic method that holds the lock throughout, or re-check the limit under the same lock before tracking.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@management/internals/modules/reverseproxy/manager/manager.go` around lines
679 - 729, The current flow checks the peer expose limit with
exposeTracker.CheckPeerExposeLimitWithLock and then later calls
exposeTracker.TrackExpose after persisting the service, allowing two concurrent
requests to both pass the check and exceed the limit; fix this by consolidating
the check+track into a single atomic operation under the same lock (or by
re-checking the limit while holding the lock immediately before tracking). Add a
new method on exposeTracker (e.g., CheckAndTrackExposeOrError(peerID, domain,
accountID) implemented using exposeCreateMu) that verifies limit and existing
domain and, if OK, records the expose atomically; call this new method in
manager.go instead of separate CheckPeerExposeLimitWithLock and TrackExpose (and
on alreadyTracked error keep the existing deleteServiceFromPeer + AlreadyExists
handling). Ensure expose_tracker.go is updated to use the same exposeCreateMu
and return clear status for "already tracked" vs "limit exceeded."
🧹 Nitpick comments (1)
management/internals/server/boot.go (1)

155-159: Use a lifecycle context for StartExposeReaper

Starting the reaper with context.Background() means it won’t stop on server shutdown or in tests. Consider passing a server lifecycle context (or storing a cancel) so the reaper can be stopped cleanly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@management/internals/server/boot.go` around lines 155 - 159,
StartExposeReaper is being launched with context.Background(), so it won't stop
on server shutdown; change it to use the server lifecycle context or a
cancellable context tied to the server so the reaper can be stopped. Locate
ReverseProxyManager() and where srv.SetReverseProxyManager(reverseProxyMgr) is
called and replace the context.Background() passed to
reverseProxyMgr.StartExposeReaper with the server's lifecycle context (or create
a context.WithCancel stored on the server and call cancel on shutdown) so the
reaper is cancelled during server shutdown/tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@management/internals/modules/reverseproxy/manager/expose_tracker.go`:
- Around line 95-101: CheckPeerExposeLimitWithLock only locks while counting,
allowing a race where multiple goroutines pass the check and then call
TrackExpose concurrently; replace this with an atomic try-and-track operation:
add a new method (e.g., TryTrackPeerExpose or TrackExposeIfAllowed) that
acquires exposeCreateMu, calls CountPeerExposes(peerID) and if count <
maxExposesPerPeer calls existing TrackExpose logic (or increments the tracked
count) while still holding the lock, then returns success/failure. Update
callers (e.g., the TrackExpose call in manager.go) to use the new
TryTrackPeerExpose so the check and track happen under the same exposeCreateMu
lock.
- Around line 120-133: reapExpiredExposes can delete an expose that a concurrent
RenewTrackedExpose just refreshed; fix by adding an expiring guard on
trackedExpose (e.g., an atomic/boolean field named expiring) and use it as a CAS
under the same lock: in reapExpiredExposes, when you detect expiry acquire
expose.mu, check expiry, set expiring=true (or CAS true) before unlocking and
then proceed to LoadAndDelete and deletion; in RenewTrackedExpose, acquire
expose.mu or use CAS to refuse/observe renewals when expiring is true (return
false/error), so a renewal cannot succeed once the reaper has marked the expose
as expiring; update references to trackedExpose, reapExpiredExposes and
RenewTrackedExpose accordingly.

In `@management/internals/modules/reverseproxy/reverseproxy.go`:
- Around line 480-521: ExposeServiceRequest.Validate must guard against a nil
receiver and validate the port upper bound; update the Validate method to first
return an error if the receiver is nil (e.g., "request cannot be nil"), then
change the port check to require 1 <= Port <= 65535 (replace the r.Port <= 0
check with a range check and an appropriate error message); keep the existing
protocol, pin, user group and NamePrefix checks unchanged in Validate to
preserve current validations.

---

Outside diff comments:
In `@management/internals/modules/reverseproxy/manager/manager.go`:
- Around line 679-729: The current flow checks the peer expose limit with
exposeTracker.CheckPeerExposeLimitWithLock and then later calls
exposeTracker.TrackExpose after persisting the service, allowing two concurrent
requests to both pass the check and exceed the limit; fix this by consolidating
the check+track into a single atomic operation under the same lock (or by
re-checking the limit while holding the lock immediately before tracking). Add a
new method on exposeTracker (e.g., CheckAndTrackExposeOrError(peerID, domain,
accountID) implemented using exposeCreateMu) that verifies limit and existing
domain and, if OK, records the expose atomically; call this new method in
manager.go instead of separate CheckPeerExposeLimitWithLock and TrackExpose (and
on alreadyTracked error keep the existing deleteServiceFromPeer + AlreadyExists
handling). Ensure expose_tracker.go is updated to use the same exposeCreateMu
and return clear status for "already tracked" vs "limit exceeded."

---

Nitpick comments:
In `@management/internals/server/boot.go`:
- Around line 155-159: StartExposeReaper is being launched with
context.Background(), so it won't stop on server shutdown; change it to use the
server lifecycle context or a cancellable context tied to the server so the
reaper can be stopped. Locate ReverseProxyManager() and where
srv.SetReverseProxyManager(reverseProxyMgr) is called and replace the
context.Background() passed to reverseProxyMgr.StartExposeReaper with the
server's lifecycle context (or create a context.WithCancel stored on the server
and call cancel on shutdown) so the reaper is cancelled during server
shutdown/tests.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f8c0321 and 006664d.

📒 Files selected for processing (17)
  • client/internal/expose/manager.go
  • management/internals/modules/reverseproxy/interface.go
  • management/internals/modules/reverseproxy/interface_mock.go
  • management/internals/modules/reverseproxy/manager/expose_tracker.go
  • management/internals/modules/reverseproxy/manager/expose_tracker_test.go
  • management/internals/modules/reverseproxy/manager/manager.go
  • management/internals/modules/reverseproxy/manager/manager_test.go
  • management/internals/modules/reverseproxy/reverseproxy.go
  • management/internals/modules/reverseproxy/reverseproxy_test.go
  • management/internals/server/boot.go
  • management/internals/shared/grpc/expose_service.go
  • management/internals/shared/grpc/expose_service_test.go
  • management/internals/shared/grpc/proxy_group_access_test.go
  • management/internals/shared/grpc/server.go
  • management/internals/shared/grpc/validate_session_test.go
  • management/server/http/handlers/proxy/auth_callback_integration_test.go
  • proxy/management_integration_test.go
💤 Files with no reviewable changes (2)
  • management/internals/shared/grpc/server.go
  • management/internals/shared/grpc/expose_service_test.go

Comment thread management/internals/modules/reverseproxy/manager/expose_tracker.go Outdated
Comment thread management/internals/modules/reverseproxy/manager/expose_tracker.go Outdated
Comment thread management/internals/modules/reverseproxy/reverseproxy.go
… tests

- Replace `TrackExpose` with `TrackExposeIfAllowed` for atomic limit checks and tracking.
- Add `expiring` flag to detect reaping in progress during renewal.
- Update `RenewTrackedExpose` to reject expiring entries.
- Refactor reaping to set `expiring` before deletion.
- Validate ports to be within range (1-65535) or fail gracefully.
- Enhance test coverage for limits, renewal, reaping, and validation edge cases.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
management/internals/modules/reverseproxy/manager/manager.go (1)

687-693: ⚠️ Potential issue | 🟠 Major

Guard against nil clusterDeriver when Domain is omitted.

If clusterDeriver is nil and req.Domain is empty, buildRandomDomain will panic when calling GetClusterDomains(). Add a guard to return a clear error when no cluster domains are configured.

🛡️ Suggested guard
 	if service.Domain == "" {
+		if m.clusterDeriver == nil {
+			return nil, status.Errorf(status.PreconditionFailed, "no cluster domains configured for peer expose")
+		}
 		domain, err := m.buildRandomDomain(service.Name)
 		if err != nil {
 			return nil, fmt.Errorf("build random domain for service %s: %w", service.Name, err)
 		}
 		service.Domain = domain
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@management/internals/modules/reverseproxy/manager/manager.go` around lines
687 - 693, When omitting Domain (service.Domain / req.Domain) and before calling
buildRandomDomain, add a guard that checks whether clusterDeriver is nil (or
inside buildRandomDomain check its receiver) and return a clear error like "no
cluster domains configured" instead of allowing a panic from
GetClusterDomains(); update the code paths that call buildRandomDomain (e.g., in
the block that sets service.Domain) to propagate this error so callers get a
readable failure rather than a nil-pointer panic (refer to symbols:
clusterDeriver, buildRandomDomain, GetClusterDomains, service.Domain).
🧹 Nitpick comments (2)
management/internals/modules/reverseproxy/reverseproxy.go (1)

528-548: Validate the custom domain suffix before concatenation.

When Domain is provided, ToService builds serviceName + "." + r.Domain; malformed suffixes (leading dot/invalid DNS) will generate an invalid FQDN and URL. Consider rejecting invalid suffixes in Validate().

♻️ Suggested validation
 func (r *ExposeServiceRequest) Validate() error {
@@
 	if r.NamePrefix != "" && !validNamePrefix.MatchString(r.NamePrefix) {
 		return fmt.Errorf("invalid name prefix %q: must be lowercase alphanumeric with optional hyphens, 1-32 characters", r.NamePrefix)
 	}
+	if r.Domain != "" && !net.IsDomainName(r.Domain) {
+		return fmt.Errorf("invalid domain %q", r.Domain)
+	}
 
 	return nil
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@management/internals/modules/reverseproxy/reverseproxy.go` around lines 528 -
548, The ToService method concatenates serviceName + "." + r.Domain without
validating r.Domain, which can produce invalid FQDNs; update the
ExposeServiceRequest.Validate() method to reject or normalize bad domain
suffixes (e.g., disallow leading dots, empty labels, spaces, or characters not
allowed in DNS labels and ensure each label length and overall length are valid)
and return an error for invalid values; reference the ToService method and the
ExposeServiceRequest.Validate() function so callers only reach ToService with a
vetted r.Domain (or trim a single leading dot in Validate() if normalization is
preferred).
management/internals/modules/reverseproxy/manager/expose_tracker.go (1)

131-146: Consider retrying reaper deletes on transient failures.

The tracking entry is removed before deleteServiceFromPeer completes. If deletion fails, the expired service becomes untracked and won’t be retried. Consider deleting only after a successful cleanup (or reinsert on error).

♻️ Possible ordering tweak
 		if expired {
-			t.activeExposes.Delete(key)
-			log.Infof("reaping expired expose session for peer %s, domain %s", expose.peerID, expose.domain)
-			if err := t.manager.deleteServiceFromPeer(context.Background(), expose.accountID, expose.peerID, expose.domain, true); err != nil {
-				log.Errorf("failed to delete expired peer-exposed service for domain %s: %v", expose.domain, err)
-			}
+			if err := t.manager.deleteServiceFromPeer(context.Background(), expose.accountID, expose.peerID, expose.domain, true); err != nil {
+				log.Errorf("failed to delete expired peer-exposed service for domain %s: %v", expose.domain, err)
+				return true
+			}
+			t.activeExposes.Delete(key)
+			log.Infof("reaping expired expose session for peer %s, domain %s", expose.peerID, expose.domain)
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@management/internals/modules/reverseproxy/manager/expose_tracker.go` around
lines 131 - 146, reapExpiredExposes currently removes the trackedExpose from
t.activeExposes before calling t.manager.deleteServiceFromPeer, which means a
transient delete error leaves the expose untracked; change the logic in
reapExpiredExposes so that you call deleteServiceFromPeer first and only call
t.activeExposes.Delete(key) after a successful delete, or if delete fails
reinsert or leave the trackedExpose in t.activeExposes (and clear expiring flag)
so it can be retried later; adjust use of trackedExpose.expiring and locking
around expose.mu to ensure consistency when retrying deletes and avoid races.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@management/internals/modules/reverseproxy/manager/manager.go`:
- Around line 785-794: StopServiceFromPeer currently calls
m.exposeTracker.StopTrackedExpose(peerID, domain) before calling
m.deleteServiceFromPeer, which causes loss of tracking if deletion fails; change
the flow to only untrack after deleteServiceFromPeer succeeds, or if you must
untrack first then restore tracking on error by re-adding the track entry.
Concretely, update StopServiceFromPeer so that it calls
m.deleteServiceFromPeer(ctx, accountID, peerID, domain, false) first and only
then calls m.exposeTracker.StopTrackedExpose(peerID, domain) on success, or if
maintaining the current order, catch errors from deleteServiceFromPeer and call
the appropriate exposeTracker method to restore tracking (the inverse of
StopTrackedExpose) using the same peerID and domain before returning the error.

---

Outside diff comments:
In `@management/internals/modules/reverseproxy/manager/manager.go`:
- Around line 687-693: When omitting Domain (service.Domain / req.Domain) and
before calling buildRandomDomain, add a guard that checks whether clusterDeriver
is nil (or inside buildRandomDomain check its receiver) and return a clear error
like "no cluster domains configured" instead of allowing a panic from
GetClusterDomains(); update the code paths that call buildRandomDomain (e.g., in
the block that sets service.Domain) to propagate this error so callers get a
readable failure rather than a nil-pointer panic (refer to symbols:
clusterDeriver, buildRandomDomain, GetClusterDomains, service.Domain).

---

Nitpick comments:
In `@management/internals/modules/reverseproxy/manager/expose_tracker.go`:
- Around line 131-146: reapExpiredExposes currently removes the trackedExpose
from t.activeExposes before calling t.manager.deleteServiceFromPeer, which means
a transient delete error leaves the expose untracked; change the logic in
reapExpiredExposes so that you call deleteServiceFromPeer first and only call
t.activeExposes.Delete(key) after a successful delete, or if delete fails
reinsert or leave the trackedExpose in t.activeExposes (and clear expiring flag)
so it can be retried later; adjust use of trackedExpose.expiring and locking
around expose.mu to ensure consistency when retrying deletes and avoid races.

In `@management/internals/modules/reverseproxy/reverseproxy.go`:
- Around line 528-548: The ToService method concatenates serviceName + "." +
r.Domain without validating r.Domain, which can produce invalid FQDNs; update
the ExposeServiceRequest.Validate() method to reject or normalize bad domain
suffixes (e.g., disallow leading dots, empty labels, spaces, or characters not
allowed in DNS labels and ensure each label length and overall length are valid)
and return an error for invalid values; reference the ToService method and the
ExposeServiceRequest.Validate() function so callers only reach ToService with a
vetted r.Domain (or trim a single leading dot in Validate() if normalization is
preferred).

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 006664d and 36b7078.

📒 Files selected for processing (5)
  • management/internals/modules/reverseproxy/manager/expose_tracker.go
  • management/internals/modules/reverseproxy/manager/expose_tracker_test.go
  • management/internals/modules/reverseproxy/manager/manager.go
  • management/internals/modules/reverseproxy/manager/manager_test.go
  • management/internals/modules/reverseproxy/reverseproxy.go

Comment thread management/internals/modules/reverseproxy/manager/manager.go
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (4)
management/internals/modules/reverseproxy/manager/manager.go (3)

751-754: Return nil instead of []string{} alongside an error.

Per Go conventions, the non-error return value should be nil when returning an error; a non-nil empty slice alongside an error is misleading to callers.

♻️ Proposed fix
 	if len(groupNames) == 0 {
-		return []string{}, fmt.Errorf("no group names provided")
+		return nil, fmt.Errorf("no group names provided")
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@management/internals/modules/reverseproxy/manager/manager.go` around lines
751 - 754, In getGroupIDsFromNames (method on managerImpl) the function
currently returns an empty non-nil slice ([]string{}) when it also returns an
error; change this to return nil for the []string return value whenever an error
is returned (for example in the early check for len(groupNames) == 0 and any
other error paths) so callers receive a nil slice alongside the error per Go
conventions.

716-732: Service is persisted before the tracker limit check; failure triggers an unnecessary write+delete cycle.

persistNewService (line 716) commits the service to the DB before TrackExposeIfAllowed (line 720). When the per-peer limit is exceeded or a duplicate is found, the service is immediately deleted via deleteServiceFromPeer. A misbehaving or retrying client can cause repeated write→delete cycles at DB level.

The current ordering is architecturally necessary (domain uniqueness is enforced inside the transaction), but consider adding a pre-flight limit check (CountPeerExposes / MaxExposesPerPeer) before persistNewService as an early-exit optimisation that avoids the DB round-trip for the common "already at limit" case. The authoritative atomic check in TrackExposeIfAllowed would still be the source of truth.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@management/internals/modules/reverseproxy/manager/manager.go` around lines
716 - 732, persistNewService is called before TrackExposeIfAllowed, causing
unnecessary DB write+delete when a peer is at its expose limit; add a pre-flight
check using CountPeerExposes (or equivalent) against MaxExposesPerPeer before
calling persistNewService and return the PreconditionFailed error early if the
count >= max to avoid the round-trip; keep the existing TrackExposeIfAllowed
call after persisting (it remains the authoritative atomic check) and retain the
deleteServiceFromPeer cleanup for the rare race/duplicate case.

670-749: Ephemeral services in the DB are not recovered into the tracker after a server restart.

The exposeTracker is entirely in-memory. On restart it is empty. Any ephemeral services that were tracked before the restart remain in the DB but are invisible to the reaper. Consequences:

  • Reap gap: Stale ephemeral services are never cleaned up.
  • Client reconnect conflict: If a peer reconnects and requests the same explicit domain, checkDomainAvailable returns AlreadyExists. For random domains a new domain is assigned and the old one silently leaks.

Consider either:

  1. A startup sweep that loads all SourceEphemeral services into the tracker (restoring their TTL countdown), or
  2. A startup sweep that bulk-deletes all SourceEphemeral services (clean-slate approach).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@management/internals/modules/reverseproxy/manager/manager.go` around lines
670 - 749, Startup must repopulate the in-memory exposeTracker for ephemeral DB
rows: add a startup sweep in the manager initialization (e.g., NewManager/Start
method on managerImpl) that queries the store for all services with Source ==
reverseproxy.SourceEphemeral and for each service either re-track it via
m.exposeTracker.TrackExposeIfAllowed(service.SourcePeer, service.Domain,
service.AccountID) (restore TTL/counts) or, if you prefer the clean-slate
approach, call m.deleteServiceFromPeer(ctx, service.AccountID,
service.SourcePeer, service.Domain, false) to remove them; ensure the sweep uses
m.store read methods to list ephemeral services, handles errors per-service (log
and continue), and runs before accepting new CreateServiceFromPeer requests so
CreateServiceFromPeer (and checkDomainAvailable/TrackExposeIfAllowed usage) sees
a consistent tracker state.
management/internals/modules/reverseproxy/manager/expose_tracker.go (1)

131-150: Thread the reaper's context into reapExpiredExposes instead of using context.Background().

reapExpiredExposes is invoked from the goroutine that respects the parent context, yet the deletion call inside it uses a detached context.Background(). If the manager shuts down, in-flight deletions won't propagate cancellation. Passing the context as a parameter is a small but cleaner improvement.

♻️ Proposed refactor
-func (t *exposeTracker) reapExpiredExposes() {
+func (t *exposeTracker) reapExpiredExposes(ctx context.Context) {
 	t.activeExposes.Range(func(key, val any) bool {
 		...
-			if err := t.manager.deleteServiceFromPeer(context.Background(), ...); err != nil {
+			if err := t.manager.deleteServiceFromPeer(ctx, ...); err != nil {

And in StartExposeReaper:

 		case <-ticker.C:
-			t.reapExpiredExposes()
+			t.reapExpiredExposes(ctx)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@management/internals/modules/reverseproxy/manager/expose_tracker.go` around
lines 131 - 150, The reaper uses context.Background() inside reapExpiredExposes
which prevents cancellation propagation; change reapExpiredExposes to accept a
context (e.g., func (t *exposeTracker) reapExpiredExposes(ctx context.Context))
and update the caller (StartExposeReaper) to pass the parent/context it already
respects, then use that ctx when calling t.manager.deleteServiceFromPeer(ctx,
expose.accountID, expose.peerID, expose.domain, true) so in-flight deletions are
cancelled when the manager shuts down.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@management/internals/modules/reverseproxy/manager/expose_tracker.go`:
- Around line 141-148: The reaper currently logs every error from
t.manager.deleteServiceFromPeer as an error even when the service was already
removed concurrently; update the expired-branch error handling to treat NotFound
as a no-op (or log at Debug) instead of Errorf: after calling
t.manager.deleteServiceFromPeer, if err != nil check errors.Is(err,
<store/manager NotFound sentinel>) (or use any existing helper like
t.manager.IsNotFound(err)); if it's a NotFound return log.Debugf("already
deleted ...") or skip logging, else keep the existing log.Errorf and
t.activeExposes.Delete(key) behavior for successful deletes. Ensure you
reference t.manager.deleteServiceFromPeer, the NotFound sentinel/helper, and
t.activeExposes.Delete when making the change.

In `@management/internals/modules/reverseproxy/manager/manager.go`:
- Around line 789-800: In StopServiceFromPeer, change the behavior when
exposeTracker.StopTrackedExpose(peerID, domain) returns false after a successful
deleteServiceFromPeer: instead of returning status.NotFound, log a warning
(including peerID and domain) that the tracker entry was missing after deletion
and then return nil; keep the existing error path when deleteServiceFromPeer
fails. This modifies managerImpl.StopServiceFromPeer to treat StopTrackedExpose
returning false as a non-fatal, unexpected condition (log context) rather than
an error, preserving correct semantics if the unreachable case ever occurs.

---

Nitpick comments:
In `@management/internals/modules/reverseproxy/manager/expose_tracker.go`:
- Around line 131-150: The reaper uses context.Background() inside
reapExpiredExposes which prevents cancellation propagation; change
reapExpiredExposes to accept a context (e.g., func (t *exposeTracker)
reapExpiredExposes(ctx context.Context)) and update the caller
(StartExposeReaper) to pass the parent/context it already respects, then use
that ctx when calling t.manager.deleteServiceFromPeer(ctx, expose.accountID,
expose.peerID, expose.domain, true) so in-flight deletions are cancelled when
the manager shuts down.

In `@management/internals/modules/reverseproxy/manager/manager.go`:
- Around line 751-754: In getGroupIDsFromNames (method on managerImpl) the
function currently returns an empty non-nil slice ([]string{}) when it also
returns an error; change this to return nil for the []string return value
whenever an error is returned (for example in the early check for
len(groupNames) == 0 and any other error paths) so callers receive a nil slice
alongside the error per Go conventions.
- Around line 716-732: persistNewService is called before TrackExposeIfAllowed,
causing unnecessary DB write+delete when a peer is at its expose limit; add a
pre-flight check using CountPeerExposes (or equivalent) against
MaxExposesPerPeer before calling persistNewService and return the
PreconditionFailed error early if the count >= max to avoid the round-trip; keep
the existing TrackExposeIfAllowed call after persisting (it remains the
authoritative atomic check) and retain the deleteServiceFromPeer cleanup for the
rare race/duplicate case.
- Around line 670-749: Startup must repopulate the in-memory exposeTracker for
ephemeral DB rows: add a startup sweep in the manager initialization (e.g.,
NewManager/Start method on managerImpl) that queries the store for all services
with Source == reverseproxy.SourceEphemeral and for each service either re-track
it via m.exposeTracker.TrackExposeIfAllowed(service.SourcePeer, service.Domain,
service.AccountID) (restore TTL/counts) or, if you prefer the clean-slate
approach, call m.deleteServiceFromPeer(ctx, service.AccountID,
service.SourcePeer, service.Domain, false) to remove them; ensure the sweep uses
m.store read methods to list ephemeral services, handles errors per-service (log
and continue), and runs before accepting new CreateServiceFromPeer requests so
CreateServiceFromPeer (and checkDomainAvailable/TrackExposeIfAllowed usage) sees
a consistent tracker state.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 36b7078 and aff7ad7.

📒 Files selected for processing (2)
  • management/internals/modules/reverseproxy/manager/expose_tracker.go
  • management/internals/modules/reverseproxy/manager/manager.go

Comment thread management/internals/modules/reverseproxy/manager/expose_tracker.go Outdated
Comment thread management/internals/modules/reverseproxy/manager/manager.go
- Improve error handling in expose session reaping logic.
- Log specific cases like already deleted services as debug.
- Ensure unhandled errors are logged as errors.
pascal-fischer
pascal-fischer previously approved these changes Feb 24, 2026
@sonarqubecloud
Copy link
Copy Markdown

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@management/internals/modules/reverseproxy/manager/expose_tracker.go`:
- Around line 148-158: The call to status.FromError is unguarded and s may be
nil, causing a panic when accessing s.ErrorType; update the error handling in
the block around t.manager.deleteServiceFromPeer(...) so you first check err: if
err == nil delete the key via t.activeExposes.Delete(key), otherwise call s,ok
:= status.FromError(err) and only inspect s.ErrorType when ok is true (handle
status.NotFound there); for non-status errors fall through to the default
logging branch that logs the error for expose.domain. Ensure you reference
t.manager.deleteServiceFromPeer, status.FromError, s (or s,ok),
t.activeExposes.Delete and expose.domain when making the change.

In `@management/internals/modules/reverseproxy/manager/manager.go`:
- Around line 720-731: The code masks failures from m.deleteServiceFromPeer when
TrackExposeIfAllowed indicates a duplicate or limit exceed; if
deleteServiceFromPeer fails the service is left persisted but the function
returns AlreadyExists/PreconditionFailed. Update the branches that call
m.deleteServiceFromPeer (the block using m.exposeTracker.TrackExposeIfAllowed,
peerID, service.Domain, accountID, maxExposesPerPeer) to check the delete error
and return it (wrap as a meaningful gRPC status like Internal with context)
instead of swallowing it—only return AlreadyExists/PreconditionFailed when the
cleanup succeeded.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between aff7ad7 and 3818cf6.

📒 Files selected for processing (2)
  • management/internals/modules/reverseproxy/manager/expose_tracker.go
  • management/internals/modules/reverseproxy/manager/manager.go

Comment on lines +148 to +158
err := t.manager.deleteServiceFromPeer(context.Background(), expose.accountID, expose.peerID, expose.domain, true)

s, _ := status.FromError(err)

switch {
case err == nil:
t.activeExposes.Delete(key)
case s.ErrorType == status.NotFound:
log.Debugf("service %s was already deleted", expose.domain)
default:
log.Errorf("failed to delete expired peer-exposed service for domain %s: %v", expose.domain, err)
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard status.FromError to avoid nil deref panic.

Line 150 assigns s without checking ok. If deleteServiceFromPeer returns a non-status.Error, s is nil and s.ErrorType will panic, crashing the reaper goroutine.

🛠️ Suggested fix
-		s, _ := status.FromError(err)
+		s, ok := status.FromError(err)

 		switch {
 		case err == nil:
 			t.activeExposes.Delete(key)
-		case s.ErrorType == status.NotFound:
+		case ok && s != nil && s.ErrorType == status.NotFound:
 			log.Debugf("service %s was already deleted", expose.domain)
 		default:
 			log.Errorf("failed to delete expired peer-exposed service for domain %s: %v", expose.domain, err)
 		}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
err := t.manager.deleteServiceFromPeer(context.Background(), expose.accountID, expose.peerID, expose.domain, true)
s, _ := status.FromError(err)
switch {
case err == nil:
t.activeExposes.Delete(key)
case s.ErrorType == status.NotFound:
log.Debugf("service %s was already deleted", expose.domain)
default:
log.Errorf("failed to delete expired peer-exposed service for domain %s: %v", expose.domain, err)
err := t.manager.deleteServiceFromPeer(context.Background(), expose.accountID, expose.peerID, expose.domain, true)
s, ok := status.FromError(err)
switch {
case err == nil:
t.activeExposes.Delete(key)
case ok && s != nil && s.ErrorType == status.NotFound:
log.Debugf("service %s was already deleted", expose.domain)
default:
log.Errorf("failed to delete expired peer-exposed service for domain %s: %v", expose.domain, err)
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@management/internals/modules/reverseproxy/manager/expose_tracker.go` around
lines 148 - 158, The call to status.FromError is unguarded and s may be nil,
causing a panic when accessing s.ErrorType; update the error handling in the
block around t.manager.deleteServiceFromPeer(...) so you first check err: if err
== nil delete the key via t.activeExposes.Delete(key), otherwise call s,ok :=
status.FromError(err) and only inspect s.ErrorType when ok is true (handle
status.NotFound there); for non-status errors fall through to the default
logging branch that logs the error for expose.domain. Ensure you reference
t.manager.deleteServiceFromPeer, status.FromError, s (or s,ok),
t.activeExposes.Delete and expose.domain when making the change.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix in a new Pr

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mlsmaycon, understood! Would you like me to create a GitHub issue to track this nil pointer dereference fix for the follow-up PR? I can include the context and suggested fix to make it easier to address later.

Comment on lines +720 to +731
alreadyTracked, allowed := m.exposeTracker.TrackExposeIfAllowed(peerID, service.Domain, accountID)
if alreadyTracked {
if err := m.deleteServiceFromPeer(ctx, accountID, peerID, service.Domain, false); err != nil {
log.WithContext(ctx).Debugf("failed to delete duplicate expose service for domain %s: %v", service.Domain, err)
}
return nil, status.Errorf(status.AlreadyExists, "peer already has an active expose session for this domain")
}
if !allowed {
if err := m.deleteServiceFromPeer(ctx, accountID, peerID, service.Domain, false); err != nil {
log.WithContext(ctx).Debugf("failed to delete service after limit exceeded for domain %s: %v", service.Domain, err)
}
return nil, status.Errorf(status.PreconditionFailed, "peer has reached the maximum number of active expose sessions (%d)", maxExposesPerPeer)
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don’t mask cleanup failures after duplicate/limit rejection.

If deleteServiceFromPeer fails, the newly persisted service remains but the caller receives AlreadyExists/PreconditionFailed, leaving an orphaned, untracked service. Surface the cleanup failure (and avoid claiming the service doesn’t exist).

🛠️ Suggested fix
 	if alreadyTracked {
 		if err := m.deleteServiceFromPeer(ctx, accountID, peerID, service.Domain, false); err != nil {
-			log.WithContext(ctx).Debugf("failed to delete duplicate expose service for domain %s: %v", service.Domain, err)
+			log.WithContext(ctx).Errorf("failed to delete duplicate expose service for domain %s: %v", service.Domain, err)
+			return nil, status.Errorf(status.Internal, "cleanup failed for domain %s: %v", service.Domain, err)
 		}
 		return nil, status.Errorf(status.AlreadyExists, "peer already has an active expose session for this domain")
 	}
 	if !allowed {
 		if err := m.deleteServiceFromPeer(ctx, accountID, peerID, service.Domain, false); err != nil {
-			log.WithContext(ctx).Debugf("failed to delete service after limit exceeded for domain %s: %v", service.Domain, err)
+			log.WithContext(ctx).Errorf("failed to delete service after limit exceeded for domain %s: %v", service.Domain, err)
+			return nil, status.Errorf(status.Internal, "cleanup failed for domain %s: %v", service.Domain, err)
 		}
 		return nil, status.Errorf(status.PreconditionFailed, "peer has reached the maximum number of active expose sessions (%d)", maxExposesPerPeer)
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@management/internals/modules/reverseproxy/manager/manager.go` around lines
720 - 731, The code masks failures from m.deleteServiceFromPeer when
TrackExposeIfAllowed indicates a duplicate or limit exceed; if
deleteServiceFromPeer fails the service is left persisted but the function
returns AlreadyExists/PreconditionFailed. Update the branches that call
m.deleteServiceFromPeer (the block using m.exposeTracker.TrackExposeIfAllowed,
peerID, service.Domain, accountID, maxExposesPerPeer) to check the delete error
and return it (wrap as a meaningful gRPC status like Internal with context)
instead of swallowing it—only return AlreadyExists/PreconditionFailed when the
cleanup succeeded.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will refactor this in a new Pr

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mlsmaycon, understood! Would you like me to create a GitHub issue to track this refactoring work for the error handling in the expose service cleanup flow?


🧠 Learnings used
Learnt from: lixmal
Repo: netbirdio/netbird PR: 5202
File: client/internal/portforward/manager.go:219-222
Timestamp: 2026-01-30T14:43:13.106Z
Learning: In client/internal/portforward/manager.go, the state persistence pattern is designed for retry-on-failure: if DeletePortMapping fails in Stop(), the state is intentionally left intact (via early return) so that cleanupResidual() can attempt deletion again on the next startup. This is the intended behavior, not a bug.

@mlsmaycon mlsmaycon merged commit 3271428 into main Feb 24, 2026
25 checks passed
@mlsmaycon mlsmaycon deleted the refactor/move-expose-renew-to-manager branch February 24, 2026 14:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants