[receiver/prometheus] retry connections to targetallocator#40982
Conversation
Signed-off-by: Charlie Le <charlie_le@apple.com>
|
Should the target allocator just add a readiness probe to ensure this doesn't happen? It looks like it can only cause collectors to fail to start, not to crash if the collector is already started: |
The target allocator could have a readiness probe setup, but the collector will still try to make the connection to the service's ClusterIP even the target allocator is not ready yet and see the same error. Here's how to reproduce: Then when the collector starts up, it will crash loop since no target allocator is ready (it was configured to have 0 replicas).
It seems like the collector could crash here at start up if the initial sync fails: When the collector crashes, its logs looks like this: |
Signed-off-by: Charlie Le <charlie_le@apple.com>
Signed-off-by: Charlie Le <charlie_le@apple.com>
Signed-off-by: Charlie Le <charlie_le@apple.com>
Signed-off-by: Charlie Le <charlie_le@apple.com>
|
Needs a |
Signed-off-by: Charlie Le <charlie_le@apple.com>
…metry#40982) <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> #### Description The collector will crash if the target allocator that it is receiving its scrape configuration from is still starting up. This change adds retry logic to give the target allocator more time to become ready since the collector and target allocator are usually deployed at the same time by the otel operator. <!--Please delete paragraphs that you did not use before submitting.--> --------- Signed-off-by: Charlie Le <charlie_le@apple.com>
This PR upgrades the otelcollector to the latest version available for the opentelemetry-collector and opentelemetry-operator. It was automatically generated by the GitHub Actions workflow. The summary of the OSS changelog is below: # Prometheusreceiver Changes ## v0.127.0 to v0.131.0 Generated on: 2025-08-04 17:36:24 --- ### v0.131.0 - [**FEATURE**] `prometheusreceiver`: Add retry logic for connection refused errors so the collector doesn't crash at startup. ([#40982](open-telemetry/opentelemetry-collector-contrib#40982)) This change adds retry logic for connection refused errors. The target allocator could be busy starting up the receiver and the first connection attempt may fail. - [**FEATURE**] `receiver/prometheus`: Add support for otel_scope_schema_url label mapping to OpenTelemetry ScopeMetrics schema URL field ([#41488](open-telemetry/opentelemetry-collector-contrib#41488)) - [**FEATURE**] `receiver/prometheusremotewrite`: Add support for Native Histogram Custom Buckets (NHCB). ([#41043](open-telemetry/opentelemetry-collector-contrib#41043)) - [**BUG FIX**] `receiver/prometheus`: Fix otel_scope_name and otel_scope_version labels not being dropped from metric attributes ([#41456](open-telemetry/opentelemetry-collector-contrib#41456)) ### v0.130.0 - [**BUG FIX**] `receiver/prometheusreceiver`: Fixes masking of authentication credentials in Prometheus receiver, when reloading the Prometheus config. ([#40520](open-telemetry/opentelemetry-collector-contrib#40520), [#40916](open-telemetry/opentelemetry-collector-contrib#40916)) - [**BUG FIX**] `receiver/prometheusremotewrite`: Handle metrics with unspecified types without panicking. ([#41005](open-telemetry/opentelemetry-collector-contrib#41005)) ### v0.129.0 - [**FEATURE**] `prometheusreceiver`: Promote the receiver.prometheusreceiver.RemoveLegacyResourceAttributes featuregate to stable ([#40572](open-telemetry/opentelemetry-collector-contrib#40572)) It has been beta since v0.126.0 - [**BUG FIX**] `prometheusreceiver`: Fix invalid metric name validation error in scrape start from target allocator. ([#35459](open-telemetry/opentelemetry-collector-contrib#35459), [#40788](open-telemetry/opentelemetry-collector-contrib#40788)) Prometheus made setting metric_name_validation_scheme, metric_name_escaping_scheme mandatory mandatory, use sane defaults. ## Summary | Category | Count | |----------|-------| | Breaking Changes | 0 | | Features | 4 | | Bug Fixes | 4 | | Other Changes | 0 | | **Total** | **8** | # Target-allocator Changes ## v0.127.0 to v0.131.0 Generated on: 2025-08-04 17:36:38 --- ### 0.131.0 - [**FEATURE**] `manager, target-allocator, opamp-bridge, must-gather`: add -trimpath when building binaries ([#4078](open-telemetry/opentelemetry-operator#4078)) - [**FEATURE**] `collector, targer allocator, opamp`: Require Go 1.24+ to build the collector, target allocator, and opamp. ([#4173](open-telemetry/opentelemetry-operator#4173)) - [**BUG FIX**] `target allocator`: check CRD availability before registering informers ([#3987](open-telemetry/opentelemetry-operator#3987)) - [**BUG FIX**] `target allocator`: Allow collector to use TLS Config from Target Allocator with ScrapeConfig ([#3724](open-telemetry/opentelemetry-operator#3724)) This change allows the target allocator to configure TLS Config for a collector using the ScrapeConfig. ### 0.129.1 - [**BREAKING**] `targetallocator, collector`: Remove stable feature gate PrometheusOperatorIsAvailable ([#4141](open-telemetry/opentelemetry-operator#4141)) - [**FEATURE**] `target allocator`: Adds support for HTML output in the target allocator. ([#3622](open-telemetry/opentelemetry-operator#3622)) - [**BUG FIX**] `target allocator`: ensure stable iteration order of target labels when generating hash ([#4082](open-telemetry/opentelemetry-operator#4082)) - [**BUG FIX**] `target allocator`: Fix OpenShift must-gather for Target Allocator ([#4084](open-telemetry/opentelemetry-operator#4084)) ## Summary | Category | Count | |----------|-------| | Breaking Changes | 1 | | Features | 3 | | Bug Fixes | 4 | | Other Changes | 0 | | **Total** | **8** | --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Grace Wehner <grace.wehner@microsoft.com> Co-authored-by: rashmichandrashekar <rashmy@microsoft.com>
Description
The collector will crash if the target allocator that it is receiving its scrape configuration from is still starting up. This change adds retry logic to give the target allocator more time to become ready since the collector and target allocator are usually deployed at the same time by the otel operator.