telemetry receiver prometheus down after some minutes #30835

carlosmt86-hub · 2024-01-29T19:53:35Z

Component(s)

No response

What happened?

Description

After some minutes (30-45) telemetry metrics stop working we can see this error doing a curl -v 127.0.0.1:8888/metrics:
collected metric "otelcol_exporter_queue_size" { label:{name:"exporter" value:"datadog"} label:{name:"service_instance_id" value:"a2cdae11-0e8f-4b05-a31a-9de31429d8a7"} label:{name:"service_name" value:"otelcol-contrib"} label:{name:"service_version" value:"0.93.0"} gauge:{value:0}} was collected before with the same name and label values

On the logs I have:
warn internal/transaction.go:123 Failed to scrape Prometheus endpoint {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_timestamp": 1706557569579, "target_labels": "{name="up", instance="127.0.0.1:8888", job="otel-collector"}"}

Steps to Reproduce

Enable telemetry and scrape it with Prometheus receiver, after some minutes stop working

Expected Result

Telemetry metrics continue working.

Actual Result

Telemetry metrics stop working after some minutes.

Collector version

v0.93.0

Environment information

Environment

OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")

OpenTelemetry Collector configuration

receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: 'otel-collector'
          scrape_interval: 60s
          static_configs:
            - targets: ['127.0.0.1:8888']
 
processors:
  batch:
    timeout: 1s
  memory_limiter:
    check_interval: 1s
    limit_mib: 200

service:
   pipelines:
      metrics/prometheus:
         receivers: [prometheus]
         processors: [memory_limiter]
         exporters: [datadog]

    telemetry:
      metrics:
         address: 0.0.0.0:8888

Log output

warn    internal/transaction.go:123     Failed to scrape Prometheus endpoint    {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_timestamp": 1706557569579, "target_labels": "{__name__=\"up\", instance=\"127.0.0.1:8888\", job=\"otel-collector\"}"}

Additional context

No response

neeej · 2024-01-30T15:43:10Z

I can confirm I also see the same issue on version 0.93.0.
I don't scrape the metrics in the same way in otelc though, as I use telegraf for that.
My metrics stops just after a couple of minutes, tops.

This worked fine on version 0.91.0
(0.92.0 doesn't work with AWS ALB due to a bug in grpc-go so not sure if it was a problem in that version or not)

I get the same error from the metrics endpoint:

$ curl http://localhost:8888/metrics
An error has occurred while serving metrics:

collected metric "otelcol_exporter_queue_size" { label:{name:"exporter"  value:"loadbalancing"}  label:{name:"service_instance_id"  value:"5fa99331-b105-4dc7-a62d-b850fa048393"}  label:{name:"service_name"  value:"otelcol-contrib"}  label:{name:"service_version"  value:"0.93.0"}  gauge:{value:0}} was collected before with the same name and label values

I run this on AL2023, (AWS EC2 arm instance) with this config:

receivers:
  otlp:
    protocols:
      grpc:

exporters:
  loadbalancing:
    protocol:
      otlp:
        timeout: 5s
        sending_queue:
          queue_size: 10000
        tls:
          insecure: true
    resolver:
      dns:
        hostname: otelsampler.etc
        timeout: 3s

processors:
  memory_limiter:
    check_interval: 1s
    limit_percentage: 80
    spike_limit_percentage: 30
  batch:
    send_batch_max_size: 12288
    timeout: 5s

extensions:
  health_check:
  zpages:

service:
  extensions: [zpages, health_check]
  pipelines:
    traces/otlp:
      receivers: [otlp]
      exporters: [loadbalancing]
      processors: [memory_limiter, batch]
    metrics:
      receivers: [otlp]
      exporters: [loadbalancing]
      processors: [memory_limiter, batch]
    logs:
      receivers: [otlp]
      exporters: [loadbalancing]
      processors: [memory_limiter, batch]

I do however have other collectors running which above collectors send to, for testing out sampling, which also runs in a similar way, and on version 0.93.0, where the metrics still works fine.
It has this config:

receivers:
  otlp:
    protocols:
      grpc:

exporters:
  file/otlpdebug:
    path: /tmp/otlpdebug
    rotation:
      max_megabytes: 10
      max_days: 2
      max_backups: 4
      localtime: true
  otlp/apmserver:
    endpoint: "https://a.apm.etc:8200"
    retry_on_failure:
      max_elapsed_time: 1000s
    sending_queue:
      queue_size: 5000
    timeout: 10s
    headers:
      Authorization: "Bearer NA"

processors:
  memory_limiter:
    check_interval: 1s
    limit_percentage: 80
    spike_limit_percentage: 30
  batch:
    send_batch_size: 5120
    send_batch_max_size: 5120
    timeout: 5s
  tail_sampling:
    decision_wait: 10s
    num_traces: 20000
    expected_new_traces_per_sec: 1000
    policies: [
      {
        name: errors-policy,
        type: status_code,
        status_code: { status_codes: [ERROR] }
      },
      {
        name: latency-policy,
        type: latency,
        latency: { threshold_ms: 500 }
      },
      {
        name: randomized-policy,
        type: probabilistic,
        probabilistic: { sampling_percentage: 10 }
      },
      {
        # Always sample if the force_sample attribute is set to true
        name: force-sample-policy,
        type: boolean_attribute,
        boolean_attribute: { key: force_sample, value: true }
      },
    ]

extensions:
  health_check:
  zpages:

service:
  extensions: [zpages, health_check]
  pipelines:
    traces/otlp:
      receivers: [otlp]
      exporters: [file/otlpdebug, otlp/apmserver]
      processors: [memory_limiter, tail_sampling, batch]
    metrics:
      receivers: [otlp]
      exporters: [otlp/apmserver]
      processors: [memory_limiter, batch]
    logs:
      receivers: [otlp]
      exporters: [otlp/apmserver]
      processors: [memory_limiter, batch]

juissi-t · 2024-01-30T19:29:25Z

@carlosmt86-hub Which PR was this issue fixed with? I'm seeing a similar problem with version 0.92.

carlosmt86-hub added bug Something isn't working needs triage New item requiring triage labels Jan 29, 2024

github-actions bot mentioned this issue Jan 30, 2024

Weekly Report: 2024-01-23 - 2024-01-30 #30848

Closed

carlosmt86-hub closed this as completed Jan 30, 2024

juissi-t mentioned this issue Jan 31, 2024

500 error when scraping metrics from otel-collector pod when loadbalancing exporter is used #30477

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

telemetry receiver prometheus down after some minutes #30835

telemetry receiver prometheus down after some minutes #30835

carlosmt86-hub commented Jan 29, 2024 •

edited

Loading

neeej commented Jan 30, 2024

juissi-t commented Jan 30, 2024

telemetry receiver prometheus down after some minutes #30835

telemetry receiver prometheus down after some minutes #30835

Comments

carlosmt86-hub commented Jan 29, 2024 • edited Loading

Component(s)

What happened?

Description

Steps to Reproduce

Expected Result

Actual Result

Collector version

Environment information

Environment

OpenTelemetry Collector configuration

Log output

Additional context

neeej commented Jan 30, 2024

juissi-t commented Jan 30, 2024

carlosmt86-hub commented Jan 29, 2024 •

edited

Loading