Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

telemetry receiver prometheus down after some minutes #30835

Closed
carlosmt86-hub opened this issue Jan 29, 2024 · 2 comments
Closed

telemetry receiver prometheus down after some minutes #30835

carlosmt86-hub opened this issue Jan 29, 2024 · 2 comments
Labels
bug Something isn't working needs triage New item requiring triage

Comments

@carlosmt86-hub
Copy link

carlosmt86-hub commented Jan 29, 2024

Component(s)

No response

What happened?

Description

After some minutes (30-45) telemetry metrics stop working we can see this error doing a curl -v 127.0.0.1:8888/metrics:
collected metric "otelcol_exporter_queue_size" { label:{name:"exporter" value:"datadog"} label:{name:"service_instance_id" value:"a2cdae11-0e8f-4b05-a31a-9de31429d8a7"} label:{name:"service_name" value:"otelcol-contrib"} label:{name:"service_version" value:"0.93.0"} gauge:{value:0}} was collected before with the same name and label values

On the logs I have:
warn internal/transaction.go:123 Failed to scrape Prometheus endpoint {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_timestamp": 1706557569579, "target_labels": "{name="up", instance="127.0.0.1:8888", job="otel-collector"}"}

Steps to Reproduce

Enable telemetry and scrape it with Prometheus receiver, after some minutes stop working

Expected Result

Telemetry metrics continue working.

Actual Result

Telemetry metrics stop working after some minutes.

Collector version

v0.93.0

Environment information

Environment

OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")

OpenTelemetry Collector configuration

receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: 'otel-collector'
          scrape_interval: 60s
          static_configs:
            - targets: ['127.0.0.1:8888']
 
processors:
  batch:
    timeout: 1s
  memory_limiter:
    check_interval: 1s
    limit_mib: 200

service:
   pipelines:
      metrics/prometheus:
         receivers: [prometheus]
         processors: [memory_limiter]
         exporters: [datadog]

    telemetry:
      metrics:
         address: 0.0.0.0:8888

Log output

warn    internal/transaction.go:123     Failed to scrape Prometheus endpoint    {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_timestamp": 1706557569579, "target_labels": "{__name__=\"up\", instance=\"127.0.0.1:8888\", job=\"otel-collector\"}"}

Additional context

No response

@carlosmt86-hub carlosmt86-hub added bug Something isn't working needs triage New item requiring triage labels Jan 29, 2024
@neeej
Copy link

neeej commented Jan 30, 2024

I can confirm I also see the same issue on version 0.93.0.
I don't scrape the metrics in the same way in otelc though, as I use telegraf for that.
My metrics stops just after a couple of minutes, tops.

This worked fine on version 0.91.0
(0.92.0 doesn't work with AWS ALB due to a bug in grpc-go so not sure if it was a problem in that version or not)

I get the same error from the metrics endpoint:

$ curl http://localhost:8888/metrics
An error has occurred while serving metrics:

collected metric "otelcol_exporter_queue_size" { label:{name:"exporter"  value:"loadbalancing"}  label:{name:"service_instance_id"  value:"5fa99331-b105-4dc7-a62d-b850fa048393"}  label:{name:"service_name"  value:"otelcol-contrib"}  label:{name:"service_version"  value:"0.93.0"}  gauge:{value:0}} was collected before with the same name and label values

I run this on AL2023, (AWS EC2 arm instance) with this config:

receivers:
  otlp:
    protocols:
      grpc:

exporters:
  loadbalancing:
    protocol:
      otlp:
        timeout: 5s
        sending_queue:
          queue_size: 10000
        tls:
          insecure: true
    resolver:
      dns:
        hostname: otelsampler.etc
        timeout: 3s

processors:
  memory_limiter:
    check_interval: 1s
    limit_percentage: 80
    spike_limit_percentage: 30
  batch:
    send_batch_max_size: 12288
    timeout: 5s

extensions:
  health_check:
  zpages:

service:
  extensions: [zpages, health_check]
  pipelines:
    traces/otlp:
      receivers: [otlp]
      exporters: [loadbalancing]
      processors: [memory_limiter, batch]
    metrics:
      receivers: [otlp]
      exporters: [loadbalancing]
      processors: [memory_limiter, batch]
    logs:
      receivers: [otlp]
      exporters: [loadbalancing]
      processors: [memory_limiter, batch]

I do however have other collectors running which above collectors send to, for testing out sampling, which also runs in a similar way, and on version 0.93.0, where the metrics still works fine.
It has this config:

receivers:
  otlp:
    protocols:
      grpc:

exporters:
  file/otlpdebug:
    path: /tmp/otlpdebug
    rotation:
      max_megabytes: 10
      max_days: 2
      max_backups: 4
      localtime: true
  otlp/apmserver:
    endpoint: "https://a.apm.etc:8200"
    retry_on_failure:
      max_elapsed_time: 1000s
    sending_queue:
      queue_size: 5000
    timeout: 10s
    headers:
      Authorization: "Bearer NA"

processors:
  memory_limiter:
    check_interval: 1s
    limit_percentage: 80
    spike_limit_percentage: 30
  batch:
    send_batch_size: 5120
    send_batch_max_size: 5120
    timeout: 5s
  tail_sampling:
    decision_wait: 10s
    num_traces: 20000
    expected_new_traces_per_sec: 1000
    policies: [
      {
        name: errors-policy,
        type: status_code,
        status_code: { status_codes: [ERROR] }
      },
      {
        name: latency-policy,
        type: latency,
        latency: { threshold_ms: 500 }
      },
      {
        name: randomized-policy,
        type: probabilistic,
        probabilistic: { sampling_percentage: 10 }
      },
      {
        # Always sample if the force_sample attribute is set to true
        name: force-sample-policy,
        type: boolean_attribute,
        boolean_attribute: { key: force_sample, value: true }
      },
    ]

extensions:
  health_check:
  zpages:

service:
  extensions: [zpages, health_check]
  pipelines:
    traces/otlp:
      receivers: [otlp]
      exporters: [file/otlpdebug, otlp/apmserver]
      processors: [memory_limiter, tail_sampling, batch]
    metrics:
      receivers: [otlp]
      exporters: [otlp/apmserver]
      processors: [memory_limiter, batch]
    logs:
      receivers: [otlp]
      exporters: [otlp/apmserver]
      processors: [memory_limiter, batch]

@juissi-t
Copy link

@carlosmt86-hub Which PR was this issue fixed with? I'm seeing a similar problem with version 0.92.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage New item requiring triage
Projects
None yet
Development

No branches or pull requests

3 participants