Skip to content

[receiver/prometheus] retry connections to targetallocator#40982

Merged
atoulme merged 18 commits into
open-telemetry:mainfrom
CharlieTLe:prom-receiver-retry-connection-refused
Jul 16, 2025
Merged

[receiver/prometheus] retry connections to targetallocator#40982
atoulme merged 18 commits into
open-telemetry:mainfrom
CharlieTLe:prom-receiver-retry-connection-refused

Conversation

@CharlieTLe
Copy link
Copy Markdown
Contributor

Description

The collector will crash if the target allocator that it is receiving its scrape configuration from is still starting up. This change adds retry logic to give the target allocator more time to become ready since the collector and target allocator are usually deployed at the same time by the otel operator.

Signed-off-by: Charlie Le <charlie_le@apple.com>
@CharlieTLe CharlieTLe requested review from a team, ArthurSens and dashpole as code owners June 29, 2025 23:42
@github-actions github-actions Bot added the receiver/prometheus Prometheus receiver label Jun 29, 2025
@github-actions github-actions Bot requested review from Aneurysm9 and krajorama June 29, 2025 23:43
@dashpole
Copy link
Copy Markdown
Contributor

Should the target allocator just add a readiness probe to ensure this doesn't happen?

It looks like it can only cause collectors to fail to start, not to crash if the collector is already started:

@CharlieTLe
Copy link
Copy Markdown
Contributor Author

CharlieTLe commented Jun 30, 2025

Should the target allocator just add a readiness probe to ensure this doesn't happen?

The target allocator could have a readiness probe setup, but the collector will still try to make the connection to the service's ClusterIP even the target allocator is not ready yet and see the same error.

Here's how to reproduce:

apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: simplest
spec:
  mode: statefulset
  config:
    receivers: 
      prometheus:
        config:
          scrape_configs: []
    service:
      pipelines:
        metrics:
          receivers: [prometheus]
          exporters: [debug]
    exporters:
      debug: {}
  replicas: 1
  targetAllocator:
    enabled: true
    replicas: 0
    prometheusCR:
      enabled: true

Then when the collector starts up, it will crash loop since no target allocator is ready (it was configured to have 0 replicas).

It looks like it can only cause collectors to fail to start, not to crash if the collector is already started:

It seems like the collector could crash here at start up if the initial sync fails:

savedHash, err := m.sync(uint64(0), httpClient)
if err != nil {
return err
}

When the collector crashes, its logs looks like this:

2025-06-29T23:25:23.911Z    info    service@v0.128.1-0.20250628171447-c6cd1aeb58b7/service.go:197    Setting up own telemetry...    {"resource": {"service.instance.id": "89dca754-b50b-4207-98b9-4052ef337b34", "service.name": "otelcontribcol", "service.version": "0.128.0-dev"}}
2025-06-29T23:25:23.912Z    info    service@v0.128.1-0.20250628171447-c6cd1aeb58b7/service.go:257    Starting otelcontribcol...    {"resource": {"service.instance.id": "89dca754-b50b-4207-98b9-4052ef337b34", "service.name": "otelcontribcol", "service.version": "0.128.0-dev"}, "Version": "0.128.0-dev", "NumCPU": 8}
2025-06-29T23:25:23.912Z    info    extensions/extensions.go:41    Starting extensions...    {"resource": {"service.instance.id": "89dca754-b50b-4207-98b9-4052ef337b34", "service.name": "otelcontribcol", "service.version": "0.128.0-dev"}}
2025-06-29T23:25:23.912Z    info    prometheusreceiver@v0.128.0/metrics_receiver.go:157    Starting discovery manager    {"resource": {"service.instance.id": "89dca754-b50b-4207-98b9-4052ef337b34", "service.name": "otelcontribcol", "service.version": "0.128.0-dev"}, "otelcol.component.id": "prometheus", "otelcol.component.kind": "receiver", "otelcol.signal": "metrics"}
2025-06-29T23:25:23.912Z    info    targetallocator/manager.go:70    Starting target allocator discovery    {"resource": {"service.instance.id": "89dca754-b50b-4207-98b9-4052ef337b34", "service.name": "otelcontribcol", "service.version": "0.128.0-dev"}, "otelcol.component.id": "prometheus", "otelcol.component.kind": "receiver", "otelcol.signal": "metrics"}
2025-06-29T23:26:25.275Z    error    targetallocator/manager.go:107    Failed to retrieve job list    {"resource": {"service.instance.id": "89dca754-b50b-4207-98b9-4052ef337b34", "service.name": "otelcontribcol", "service.version": "0.128.0-dev"}, "otelcol.component.id": "prometheus", "otelcol.component.kind": "receiver", "otelcol.signal": "metrics", "error": "failed to connect to target allocator after 1m0s, last error: Get \"http://simplest-targetallocator:80/scrape_configs\": dial tcp 10.96.206.86:80: connect: connection refused"}
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver/targetallocator.(*Manager).sync
    github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver@v0.128.0/targetallocator/manager.go:107
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver/targetallocator.(*Manager).Start
    github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver@v0.128.0/targetallocator/manager.go:72
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver.(*pReceiver).Start
    github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver@v0.128.0/metrics_receiver.go:121
go.opentelemetry.io/collector/service/internal/graph.(*Graph).StartAll
    go.opentelemetry.io/collector/service@v0.128.1-0.20250628171447-c6cd1aeb58b7/internal/graph/graph.go:431
go.opentelemetry.io/collector/service.(*Service).Start
    go.opentelemetry.io/collector/service@v0.128.1-0.20250628171447-c6cd1aeb58b7/service.go:272
go.opentelemetry.io/collector/otelcol.(*Collector).setupConfigurationComponents
    go.opentelemetry.io/collector/otelcol@v0.128.1-0.20250628171447-c6cd1aeb58b7/collector.go:242
go.opentelemetry.io/collector/otelcol.(*Collector).Run
    go.opentelemetry.io/collector/otelcol@v0.128.1-0.20250628171447-c6cd1aeb58b7/collector.go:312
go.opentelemetry.io/collector/otelcol.NewCommand.func1
    go.opentelemetry.io/collector/otelcol@v0.128.1-0.20250628171447-c6cd1aeb58b7/command.go:39
github.com/spf13/cobra.(*Command).execute
    github.com/spf13/cobra@v1.9.1/command.go:1015
github.com/spf13/cobra.(*Command).ExecuteC
    github.com/spf13/cobra@v1.9.1/command.go:1148
github.com/spf13/cobra.(*Command).Execute
    github.com/spf13/cobra@v1.9.1/command.go:1071
main.runInteractive
    github.com/open-telemetry/opentelemetry-collector-contrib/cmd/otelcontribcol/main.go:70
main.run
    github.com/open-telemetry/opentelemetry-collector-contrib/cmd/otelcontribcol/main_others.go:10
main.main
    github.com/open-telemetry/opentelemetry-collector-contrib/cmd/otelcontribcol/main.go:63
runtime.main
    runtime/proc.go:272
2025-06-29T23:26:25.276Z    error    graph/graph.go:438    Failed to start component    {"resource": {"service.instance.id": "89dca754-b50b-4207-98b9-4052ef337b34", "service.name": "otelcontribcol", "service.version": "0.128.0-dev"}, "error": "failed to connect to target allocator after 1m0s, last error: Get \"http://simplest-targetallocator:80/scrape_configs\": dial tcp 10.96.206.86:80: connect: connection refused", "type": "Receiver", "id": "prometheus"}
2025-06-29T23:26:25.276Z    info    service@v0.128.1-0.20250628171447-c6cd1aeb58b7/service.go:322    Starting shutdown...    {"resource": {"service.instance.id": "89dca754-b50b-4207-98b9-4052ef337b34", "service.name": "otelcontribcol", "service.version": "0.128.0-dev"}}
2025-06-29T23:26:25.277Z    info    extensions/extensions.go:69    Stopping extensions...    {"resource": {"service.instance.id": "89dca754-b50b-4207-98b9-4052ef337b34", "service.name": "otelcontribcol", "service.version": "0.128.0-dev"}}
2025-06-29T23:26:25.277Z    info    service@v0.128.1-0.20250628171447-c6cd1aeb58b7/service.go:336    Shutdown complete.    {"resource": {"service.instance.id": "89dca754-b50b-4207-98b9-4052ef337b34", "service.name": "otelcontribcol", "service.version": "0.128.0-dev"}}
Error: cannot start pipelines: failed to start "prometheus" receiver: failed to connect to target allocator after 1m0s, last error: Get "http://simplest-targetallocator:80/scrape_configs": dial tcp 10.96.206.86:80: connect: connection refused
2025/06/29 23:26:25 collector server run finished with error: cannot start pipelines: failed to start "prometheus" receiver: failed to connect to target allocator after 1m0s, last error: Get "http://simplest-targetallocator:80/scrape_configs": dial tcp 10.96.206.86:80: connect: connection refused

Comment thread receiver/prometheusreceiver/targetallocator/manager.go Outdated
Comment thread receiver/prometheusreceiver/targetallocator/manager.go Outdated
@CharlieTLe CharlieTLe requested a review from dashpole July 2, 2025 16:54
@atoulme
Copy link
Copy Markdown
Contributor

atoulme commented Jul 14, 2025

Needs a make tidy

@ArthurSens ArthurSens added the ready to merge Code review completed; ready to merge by maintainers label Jul 16, 2025
@atoulme atoulme merged commit b8dda6e into open-telemetry:main Jul 16, 2025
187 checks passed
@github-actions github-actions Bot added this to the next release milestone Jul 16, 2025
Dylan-M pushed a commit to Dylan-M/opentelemetry-collector-contrib that referenced this pull request Aug 5, 2025
…metry#40982)

<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description

The collector will crash if the target allocator that it is receiving
its scrape configuration from is still starting up. This change adds
retry logic to give the target allocator more time to become ready since
the collector and target allocator are usually deployed at the same time
by the otel operator.

<!--Please delete paragraphs that you did not use before submitting.-->

---------

Signed-off-by: Charlie Le <charlie_le@apple.com>
gracewehner added a commit to Azure/prometheus-collector that referenced this pull request Aug 13, 2025
This PR upgrades the otelcollector to the latest version available for
the opentelemetry-collector and opentelemetry-operator.

It was automatically generated by the GitHub Actions workflow.

The summary of the OSS changelog is below:
# Prometheusreceiver Changes
## v0.127.0 to v0.131.0

Generated on: 2025-08-04 17:36:24

---

### v0.131.0
- [**FEATURE**] `prometheusreceiver`: Add retry logic for connection
refused errors so the collector doesn't crash at startup.
([#40982](open-telemetry/opentelemetry-collector-contrib#40982))
This change adds retry logic for connection refused errors. The target
allocator could be busy starting up the receiver and the first
connection attempt may fail.
- [**FEATURE**] `receiver/prometheus`: Add support for
otel_scope_schema_url label mapping to OpenTelemetry ScopeMetrics schema
URL field
([#41488](open-telemetry/opentelemetry-collector-contrib#41488))
- [**FEATURE**] `receiver/prometheusremotewrite`: Add support for Native
Histogram Custom Buckets (NHCB).
([#41043](open-telemetry/opentelemetry-collector-contrib#41043))
- [**BUG FIX**] `receiver/prometheus`: Fix otel_scope_name and
otel_scope_version labels not being dropped from metric attributes
([#41456](open-telemetry/opentelemetry-collector-contrib#41456))
### v0.130.0
- [**BUG FIX**] `receiver/prometheusreceiver`: Fixes masking of
authentication credentials in Prometheus receiver, when reloading the
Prometheus config.
([#40520](open-telemetry/opentelemetry-collector-contrib#40520),
[#40916](open-telemetry/opentelemetry-collector-contrib#40916))
- [**BUG FIX**] `receiver/prometheusremotewrite`: Handle metrics with
unspecified types without panicking.
([#41005](open-telemetry/opentelemetry-collector-contrib#41005))
### v0.129.0
- [**FEATURE**] `prometheusreceiver`: Promote the
receiver.prometheusreceiver.RemoveLegacyResourceAttributes featuregate
to stable
([#40572](open-telemetry/opentelemetry-collector-contrib#40572))
It has been beta since v0.126.0
- [**BUG FIX**] `prometheusreceiver`: Fix invalid metric name validation
error in scrape start from target allocator.
([#35459](open-telemetry/opentelemetry-collector-contrib#35459),
[#40788](open-telemetry/opentelemetry-collector-contrib#40788))
Prometheus made setting metric_name_validation_scheme,
metric_name_escaping_scheme mandatory mandatory, use sane defaults.

## Summary

| Category | Count |
|----------|-------|
| Breaking Changes | 0 |
| Features | 4 |
| Bug Fixes | 4 |
| Other Changes | 0 |
| **Total** | **8** |

# Target-allocator Changes
## v0.127.0 to v0.131.0

Generated on: 2025-08-04 17:36:38

---

### 0.131.0
- [**FEATURE**] `manager, target-allocator, opamp-bridge, must-gather`:
add -trimpath when building binaries
([#4078](open-telemetry/opentelemetry-operator#4078))
- [**FEATURE**] `collector, targer allocator, opamp`: Require Go 1.24+
to build the collector, target allocator, and opamp.
([#4173](open-telemetry/opentelemetry-operator#4173))
- [**BUG FIX**] `target allocator`: check CRD availability before
registering informers
([#3987](open-telemetry/opentelemetry-operator#3987))
- [**BUG FIX**] `target allocator`: Allow collector to use TLS Config
from Target Allocator with ScrapeConfig
([#3724](open-telemetry/opentelemetry-operator#3724))
This change allows the target allocator to configure TLS Config for a
collector using the ScrapeConfig.
### 0.129.1
- [**BREAKING**] `targetallocator, collector`: Remove stable feature
gate PrometheusOperatorIsAvailable
([#4141](open-telemetry/opentelemetry-operator#4141))
- [**FEATURE**] `target allocator`: Adds support for HTML output in the
target allocator.
([#3622](open-telemetry/opentelemetry-operator#3622))
- [**BUG FIX**] `target allocator`: ensure stable iteration order of
target labels when generating hash
([#4082](open-telemetry/opentelemetry-operator#4082))
- [**BUG FIX**] `target allocator`: Fix OpenShift must-gather for Target
Allocator
([#4084](open-telemetry/opentelemetry-operator#4084))

## Summary

| Category | Count |
|----------|-------|
| Breaking Changes | 1 |
| Features | 3 |
| Bug Fixes | 4 |
| Other Changes | 0 |
| **Total** | **8** |

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Grace Wehner <grace.wehner@microsoft.com>
Co-authored-by: rashmichandrashekar <rashmy@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready to merge Code review completed; ready to merge by maintainers receiver/prometheus Prometheus receiver receiver/purefa receiver/purefb receiver/simpleprometheus

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants