Skip to content

loadbalancingexporter: add endpoint active probes#52

Merged
amir-jakoby merged 10 commits into
mainfrom
amiri/saw-7374-lb-reachability
May 2, 2026
Merged

loadbalancingexporter: add endpoint active probes#52
amir-jakoby merged 10 commits into
mainfrom
amiri/saw-7374-lb-reachability

Conversation

@amir-jakoby

@amir-jakoby amir-jakoby commented May 2, 2026

Copy link
Copy Markdown

loadbalancingexporter: Endpoint active probes

Adds active TCP reachability probes for endpoint_health so LB pods can exclude per-pod-unreachable backends before customer traffic hits them. SAW-7374.

Description

  • Adds endpoint_health.active_probe config, validation, jittered probe loop, fall/rise transitions, and shutdown handling.
  • Feeds probe unhealthy/healthy transitions through existing backend state/quarantine/fail-open metrics.
  • Updates README, schema, sample config, and regression tests for exclusion, recovery, fail-open, and validation.

Tests

  • go test ./... (from exporter/loadbalancingexporter)

Note

Medium Risk
Adds a new background probe loop that can dynamically remove/re-add backends and rebuild the routing ring, so misconfiguration or edge cases could impact traffic routing and exporter lifecycle. Feature is disabled by default, but when enabled it affects availability behavior and shutdown timing.

Overview
Adds an optional active reachability probe under endpoint_health so each loadbalancingexporter pod can proactively exclude locally unreachable backends (TCP connect) and later re-admit them after configurable fall/rise thresholds, while preserving existing fail-open behavior when all backends are unhealthy.

Introduces new config/schema/defaults and validation for endpoint_health.active_probe (type, interval/timeout, jitter %, max concurrency, fall/rise), integrates probe-driven state into the existing quarantine/eligibility + metrics flow (reason="active_probe"), and adds a jittered, concurrency-limited probe loop that starts with the exporter and stops on shutdown, with expanded unit tests and docs/sample config/changelog updates.

Reviewed by Cursor Bugbot for commit e397d88. Bugbot is set up for automated code reviews on this repo. Configure here.


Summary by cubic

Adds optional TCP active probes to loadbalancingexporter under endpoint_health to exclude locally unreachable backends before routing, addressing SAW-7374. Recovery is tightened: probe success never overrides an active transport quarantine; readmission requires rise successes and the quarantine to expire, and export success/expiry won’t readmit while probes still mark the backend unhealthy; metrics preserve the original transport reason when applicable.

  • New Features

    • endpoint_health.active_probe config with validation: type: tcp_connect, interval, timeout < interval, jitter (0–100%, non-finite rejected), max_concurrency, fall, rise (requires endpoint_health.enabled=true).
    • Jittered, concurrency-limited TCP-connect probe loop per LB pod; waits before first cycle, starts with the exporter, and stops cleanly on shutdown.
    • Probe failures/successes drive quarantine/readmit and rebuild the routing ring; fail-open when all endpoints are locally unhealthy.
    • Metrics reuse: transitions update backend_state, backend_quarantine_total, backend_unquarantine_total, and backend_fail_open_total, with reason reflecting the source (active_probe vs transport); README/schema/sample config updated; tests cover validation (incl. NaN/Inf jitter), exclusion/recovery, fail-open, concurrency, first-interval wait, reason preservation, and shutdown.
  • Migration

    • Disabled by default; no change unless enabled.
    • Enable by setting endpoint_health.enabled=true and configuring endpoint_health.active_probe (default type: tcp_connect).
    • Keep using forwarding_health.backend_usability for readiness; exporter remains fail-open by default.

Written for commit e397d88. Summary will update on new commits.

Summary by CodeRabbit

  • New Features

    • Added optional active TCP probing for load balancer endpoint health checks with configurable intervals, timeouts, and recovery thresholds to safely exclude and recover unreachable backends.
  • Documentation

    • Added design specifications and README documentation for active probe configuration and behavior.
  • Tests

    • Added comprehensive test coverage for active probe health checking, endpoint exclusion/recovery, and concurrency handling.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 issues found across 11 files

Confidence score: 3/5

  • There is a concrete regression risk in exporter/loadbalancingexporter/endpoint_health.go: normal export success can clear probeUnhealthy without meeting probe rise recovery, which may bypass intended hysteresis and allow premature reintegration.
  • exporter/loadbalancingexporter/endpoint_health.go also appears to keep expired quarantines as the next expiry while probeUnhealthy is true, so quarantineRefreshDue() may stay true and cause refresh checks on every routing call.
  • exporter/loadbalancingexporter/loadbalancer.go likely misses startup spreading because probes run immediately on pod start; jitter only affects later cycles, increasing thundering-herd risk at initialization.
  • Pay close attention to exporter/loadbalancingexporter/endpoint_health.go and exporter/loadbalancingexporter/loadbalancer.go - recovery gating, quarantine refresh behavior, and first-cycle probe jitter are the main risk areas.
Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="exporter/loadbalancingexporter/endpoint_health.go">

<violation number="1" location="exporter/loadbalancingexporter/endpoint_health.go:269">
P1: According to linked Linear issue SAW-7374, active-probe recovery should require probe `rise` successes. This change lets any normal export success clear `probeUnhealthy`, which bypasses probe hysteresis and can reintroduce flapping endpoints too early.</violation>

<violation number="2" location="exporter/loadbalancingexporter/endpoint_health.go:454">
P2: Expired quarantines are still tracked as the next expiry when `probeUnhealthy` is true, so `quarantineRefreshDue()` can remain true indefinitely and trigger refresh checks on every routing call.</violation>
</file>

<file name="exporter/loadbalancingexporter/loadbalancer.go">

<violation number="1" location="exporter/loadbalancingexporter/loadbalancer.go:803">
P2: Add an initial jittered delay before the first active-probe cycle. As written, every LB pod probes all endpoints immediately on startup, so the jitter only affects later cycles.

According to linked Linear issue SAW-7374, active probes should add jitter so LB pods do not probe all endpoints in lockstep.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.

Comment thread exporter/loadbalancingexporter/endpoint_health.go Outdated
Comment thread exporter/loadbalancingexporter/endpoint_health.go Outdated
Comment thread exporter/loadbalancingexporter/loadbalancer.go Outdated

@sawmills-architect-review sawmills-architect-review Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outcome: REQUEST_CHANGES.

What I checked: exporter/loadbalancingexporter owns the endpoint-health routing and active-probe lifecycle; I reviewed config.go, endpoint_health.go, loadbalancer.go, config_test.go, endpoint_health_test.go, and loadbalancer_test.go on the current head.

Verification: go test ./exporter/loadbalancingexporter/... was blocked by an unrelated github.com/omnition/scribe-go@v1.0.0 fetch failure in the broader module graph.

Comment thread exporter/loadbalancingexporter/endpoint_health.go Outdated

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 issues found across 1 file (changes from recent commits).

Requires human review: Auto-approval blocked by 3 unresolved issues from previous reviews.

Comment thread exporter/loadbalancingexporter/endpoint_health.go

@sawmills-architect-review sawmills-architect-review Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outcome: APPROVE.

What I checked: exporter/loadbalancingexporter owns the endpoint-health state machine and active-probe loop; I reviewed endpoint_health.go, loadbalancer.go, config.go, config_test.go, endpoint_health_test.go, and loadbalancer_test.go on the current head.

Verification: go test ./... in /tmp/otel-pr52/exporter/loadbalancingexporter passed.

Ownership check: exporter/loadbalancingexporter owns the endpoint-health contract; I checked the state machine, probe loop, config validation, and package tests.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 issues found across 4 files (changes from recent commits).

Requires human review: This PR adds a complex active probing sub-system to the load balancer, including new goroutines, concurrency management, and state machine transitions that directly impact routing logic.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 839069e. Configure here.

Comment thread exporter/loadbalancingexporter/endpoint_health.go

@sawmills-architect-review sawmills-architect-review Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outcome: APPROVE.
What I checked: exporter/loadbalancingexporter owns the endpoint-health state machine and active-probe loop; I checked endpoint_health.go, loadbalancer.go, config.go, config_test.go, endpoint_health_test.go, and loadbalancer_test.go on the current head.
Verification: cd /tmp/otel-pr52/exporter/loadbalancingexporter && go test ./...
Ownership check: exporter/loadbalancingexporter owns the endpoint-health contract; I checked the state machine, probe loop, config, and package tests.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 issues found across 2 files (changes from recent commits).

Requires human review: Significant addition of background network probing and concurrent routing logic to a core component. Impacts telemetry routing and requires human review for reliability and edge cases.

@sawmills-architect-review sawmills-architect-review Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outcome: REQUEST_CHANGES.

What I checked: exporter/loadbalancingexporter owns the active-probe state machine and routing metrics; I reviewed endpoint_health.go and loadbalancer.go on commit eb62d70.

Verification: go test ./exporter/loadbalancingexporter/... from the repo root failed before package execution on an unrelated github.com/omnition/scribe-go@v1.0.0 fetch.

Comment thread exporter/loadbalancingexporter/endpoint_health.go

@sawmills-architect-review sawmills-architect-review Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[ARCH-REVIEW] Re-review: APPROVE.

What I checked: exporter/loadbalancingexporter owns the endpoint-health state machine and active-probe loop; I reviewed endpoint_health.go, loadbalancer.go, config.go, config_test.go, and loadbalancer_test.go on the current head.

Verification: cd /tmp/otel-pr52/exporter/loadbalancingexporter && go test ./...

Ownership check: exporter/loadbalancingexporter owns the endpoint-health contract; I checked the state machine, probe loop, config validation, and package tests.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 issues found across 5 files (changes from recent commits).

Requires human review: Adds complex concurrency and networking logic for active health probing. This affects core routing decisions and requires human verification of its impact on system stability.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sanitized review metadata.

Comment thread exporter/loadbalancingexporter/config_test.go
Comment thread exporter/loadbalancingexporter/endpoint_health.go Outdated
Comment thread exporter/loadbalancingexporter/loadbalancer_test.go

@sawmills-architect-review sawmills-architect-review Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outcome: APPROVE.

What I checked: exporter/loadbalancingexporter owns the endpoint-health state machine and active-probe loop; I reviewed endpoint_health.go, loadbalancer.go, config_test.go, and loadbalancer_test.go on the current head.

Verification: cd /Users/amirjakoby/Code/sawmills-collector-contrib/exporter/loadbalancingexporter && go test ./...

Ownership check: exporter/loadbalancingexporter owns the endpoint-health contract; I checked the state machine, probe loop, config validation, and package tests.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 issues found across 4 files (changes from recent commits).

Requires human review: This PR introduces significant new background logic and an active probe loop that affects core traffic routing and load balancing, which requires careful human review.

@amir-jakoby amir-jakoby merged commit 2b1e97b into main May 2, 2026
197 checks passed
@amir-jakoby amir-jakoby deleted the amiri/saw-7374-lb-reachability branch May 2, 2026 04:10
@Sawmills Sawmills deleted a comment from coderabbitai Bot May 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant