loadbalancingexporter: add endpoint active probes#52
Conversation
Assisted-by: OpenAI GPT-5
Assisted-by: OpenAI GPT-5
Assisted-by: OpenAI GPT-5
There was a problem hiding this comment.
3 issues found across 11 files
Confidence score: 3/5
- There is a concrete regression risk in
exporter/loadbalancingexporter/endpoint_health.go: normal export success can clearprobeUnhealthywithout meeting proberiserecovery, which may bypass intended hysteresis and allow premature reintegration. exporter/loadbalancingexporter/endpoint_health.goalso appears to keep expired quarantines as the next expiry whileprobeUnhealthyis true, soquarantineRefreshDue()may stay true and cause refresh checks on every routing call.exporter/loadbalancingexporter/loadbalancer.golikely misses startup spreading because probes run immediately on pod start; jitter only affects later cycles, increasing thundering-herd risk at initialization.- Pay close attention to
exporter/loadbalancingexporter/endpoint_health.goandexporter/loadbalancingexporter/loadbalancer.go- recovery gating, quarantine refresh behavior, and first-cycle probe jitter are the main risk areas.
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="exporter/loadbalancingexporter/endpoint_health.go">
<violation number="1" location="exporter/loadbalancingexporter/endpoint_health.go:269">
P1: According to linked Linear issue SAW-7374, active-probe recovery should require probe `rise` successes. This change lets any normal export success clear `probeUnhealthy`, which bypasses probe hysteresis and can reintroduce flapping endpoints too early.</violation>
<violation number="2" location="exporter/loadbalancingexporter/endpoint_health.go:454">
P2: Expired quarantines are still tracked as the next expiry when `probeUnhealthy` is true, so `quarantineRefreshDue()` can remain true indefinitely and trigger refresh checks on every routing call.</violation>
</file>
<file name="exporter/loadbalancingexporter/loadbalancer.go">
<violation number="1" location="exporter/loadbalancingexporter/loadbalancer.go:803">
P2: Add an initial jittered delay before the first active-probe cycle. As written, every LB pod probes all endpoints immediately on startup, so the jitter only affects later cycles.
According to linked Linear issue SAW-7374, active probes should add jitter so LB pods do not probe all endpoints in lockstep.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.
There was a problem hiding this comment.
Outcome: REQUEST_CHANGES.
What I checked: exporter/loadbalancingexporter owns the endpoint-health routing and active-probe lifecycle; I reviewed config.go, endpoint_health.go, loadbalancer.go, config_test.go, endpoint_health_test.go, and loadbalancer_test.go on the current head.
Verification: go test ./exporter/loadbalancingexporter/... was blocked by an unrelated github.com/omnition/scribe-go@v1.0.0 fetch failure in the broader module graph.
Assisted-by: GPT-5 Codex
There was a problem hiding this comment.
Outcome: APPROVE.
What I checked: exporter/loadbalancingexporter owns the endpoint-health state machine and active-probe loop; I reviewed endpoint_health.go, loadbalancer.go, config.go, config_test.go, endpoint_health_test.go, and loadbalancer_test.go on the current head.
Verification: go test ./... in /tmp/otel-pr52/exporter/loadbalancingexporter passed.
Ownership check: exporter/loadbalancingexporter owns the endpoint-health contract; I checked the state machine, probe loop, config validation, and package tests.
There was a problem hiding this comment.
0 issues found across 4 files (changes from recent commits).
Requires human review: This PR adds a complex active probing sub-system to the load balancer, including new goroutines, concurrency management, and state machine transitions that directly impact routing logic.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 839069e. Configure here.
Assisted-by: GPT-5 Codex
There was a problem hiding this comment.
Outcome: APPROVE.
What I checked: exporter/loadbalancingexporter owns the endpoint-health state machine and active-probe loop; I checked endpoint_health.go, loadbalancer.go, config.go, config_test.go, endpoint_health_test.go, and loadbalancer_test.go on the current head.
Verification: cd /tmp/otel-pr52/exporter/loadbalancingexporter && go test ./...
Ownership check: exporter/loadbalancingexporter owns the endpoint-health contract; I checked the state machine, probe loop, config, and package tests.
Assisted-by: GPT-5 Codex
There was a problem hiding this comment.
0 issues found across 2 files (changes from recent commits).
Requires human review: Significant addition of background network probing and concurrent routing logic to a core component. Impacts telemetry routing and requires human review for reliability and edge cases.
There was a problem hiding this comment.
Outcome: REQUEST_CHANGES.
What I checked: exporter/loadbalancingexporter owns the active-probe state machine and routing metrics; I reviewed endpoint_health.go and loadbalancer.go on commit eb62d70.
Verification: go test ./exporter/loadbalancingexporter/... from the repo root failed before package execution on an unrelated github.com/omnition/scribe-go@v1.0.0 fetch.
…antine Assisted-by: GPT-5 Codex
There was a problem hiding this comment.
[ARCH-REVIEW] Re-review: APPROVE.
What I checked: exporter/loadbalancingexporter owns the endpoint-health state machine and active-probe loop; I reviewed endpoint_health.go, loadbalancer.go, config.go, config_test.go, and loadbalancer_test.go on the current head.
Verification: cd /tmp/otel-pr52/exporter/loadbalancingexporter && go test ./...
Ownership check: exporter/loadbalancingexporter owns the endpoint-health contract; I checked the state machine, probe loop, config validation, and package tests.
There was a problem hiding this comment.
0 issues found across 5 files (changes from recent commits).
Requires human review: Adds complex concurrency and networking logic for active health probing. This affects core routing decisions and requires human verification of its impact on system stability.
Assisted-by: GPT-5 Codex
There was a problem hiding this comment.
Outcome: APPROVE.
What I checked: exporter/loadbalancingexporter owns the endpoint-health state machine and active-probe loop; I reviewed endpoint_health.go, loadbalancer.go, config_test.go, and loadbalancer_test.go on the current head.
Verification: cd /Users/amirjakoby/Code/sawmills-collector-contrib/exporter/loadbalancingexporter && go test ./...
Ownership check: exporter/loadbalancingexporter owns the endpoint-health contract; I checked the state machine, probe loop, config validation, and package tests.

loadbalancingexporter: Endpoint active probes
Adds active TCP reachability probes for endpoint_health so LB pods can exclude per-pod-unreachable backends before customer traffic hits them. SAW-7374.
Description
Tests
Note
Medium Risk
Adds a new background probe loop that can dynamically remove/re-add backends and rebuild the routing ring, so misconfiguration or edge cases could impact traffic routing and exporter lifecycle. Feature is disabled by default, but when enabled it affects availability behavior and shutdown timing.
Overview
Adds an optional active reachability probe under
endpoint_healthso eachloadbalancingexporterpod can proactively exclude locally unreachable backends (TCP connect) and later re-admit them after configurablefall/risethresholds, while preserving existing fail-open behavior when all backends are unhealthy.Introduces new config/schema/defaults and validation for
endpoint_health.active_probe(type, interval/timeout, jitter %, max concurrency, fall/rise), integrates probe-driven state into the existing quarantine/eligibility + metrics flow (reason="active_probe"), and adds a jittered, concurrency-limited probe loop that starts with the exporter and stops on shutdown, with expanded unit tests and docs/sample config/changelog updates.Reviewed by Cursor Bugbot for commit e397d88. Bugbot is set up for automated code reviews on this repo. Configure here.
Summary by cubic
Adds optional TCP active probes to
loadbalancingexporterunderendpoint_healthto exclude locally unreachable backends before routing, addressing SAW-7374. Recovery is tightened: probe success never overrides an active transport quarantine; readmission requiresrisesuccesses and the quarantine to expire, and export success/expiry won’t readmit while probes still mark the backend unhealthy; metrics preserve the original transport reason when applicable.New Features
endpoint_health.active_probeconfig with validation:type: tcp_connect,interval,timeout < interval,jitter(0–100%, non-finite rejected),max_concurrency,fall,rise(requiresendpoint_health.enabled=true).backend_state,backend_quarantine_total,backend_unquarantine_total, andbackend_fail_open_total, with reason reflecting the source (active_probevs transport); README/schema/sample config updated; tests cover validation (incl. NaN/Inf jitter), exclusion/recovery, fail-open, concurrency, first-interval wait, reason preservation, and shutdown.Migration
endpoint_health.enabled=trueand configuringendpoint_health.active_probe(defaulttype: tcp_connect).forwarding_health.backend_usabilityfor readiness; exporter remains fail-open by default.Written for commit e397d88. Summary will update on new commits.
Summary by CodeRabbit
New Features
Documentation
Tests