Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[exporter/prometheusremotewrite] "target" metric causes out-of-order samples errors on load balanced collector deployment #12300

Closed
badjware opened this issue Jul 11, 2022 · 25 comments
Labels
bug Something isn't working

Comments

@badjware
Copy link

badjware commented Jul 11, 2022

Describe the bug
We operate a cluster of multiple opentelemetry-collector configured to be load balanced and autoscaled. We recently updated from 0.35 to 0.49. This update caused our backend to start complaining about out-of-order samples on the "target" metric:

2022-07-11T21:19:07.569Z        error   exporterhelper/queued_retry.go:149      Exporting failed. Try enabling retry_on_failure config option.  {"kind": "exporter", "name": "prometheusremotewrite", "error": "Permanent error: Permanent error: remote write returned HTTP status 400 Bad Request; err = <nil>: user=[hidden]: err: out of order sample. timestamp=2022-07-11T21:19:00.673Z, series={__name__="target", http_scheme="http", instance="[hidden]:8080", job="cadvisor", net_host_name="[hidden]", net_host_port="8"}

We think the issue stems from how the timestamp of the "target" metric is generated. The timestamp of the sample is generated from the batch of metrics and because multiple samples are generated concurrently in different instance of opentelemetry, they may not align and end up in the backend with the timestamp out of order. In addition, because every instance of opentelemetry emits the metric with the exact same set of labels, it's causing metric collisions.

Steps to reproduce

  1. Configure a deployment of at least 2 opentelemetry-collector with identical config
  2. Configure a loadbalancer in front of the opentelemetry-collector(in our case, AWS ALB)
  3. Send metrics to the load balanced endpoint.

What did you expect to see?
We wish to have an additionnal label just for the "target" metric. This label would be allow to have a separate serie for each instance of opentelemetry-collector. external_labels is not appropriate for our use-case because it affect all metrics and causes duplication issues, cardinality explosion and generally a big mess.

What did you see instead?
Many errors in the logs of opentelemetry-collector and multiple dropped samples of the "target" metric because of an out-of-order samples error.

What version did you use?
0.49

What config did you use?

receivers:
  otlp:
    protocols:
      grpc:

exporters:
  prometheusremotewrite:
    endpoint: "${PROMETHEUS_REMOTE_WRITE_ENDPOINT}"
    timeout: 30s
    # external_labels:
    #   replica: "${NAME}"

service:
  extensions: []
  pipelines:
    metrics:
      receivers: [otlp]
      processors: []
      exporters: [prometheusremotewrite]

Environment
deployed in AWS EKS
remotewrite endpoint is AWS Amazon Managed Service for Prometheus

Additional context
The "target" metric has been added in the following pr: #8493. There is an interesting comment about the choice of the timestamp of the sample.

I am willing to work on a pr.

@badjware badjware added the bug Something isn't working label Jul 11, 2022
@hardproblems
Copy link
Contributor

hardproblems commented Jul 15, 2022

We're also seeing this issue when using the prometheusreceiver with prometheusremotewriteexporter. It seems that the prometheusreceiver converts scrape target label as a datapoint attribute instead of as resource attribute (which seems more appropriate imo), so collectors with the same prometheusreceiver config will have the same labelset on the target metric and overwrite each other. It's possible that the otlpreceiver has the same issue. It would be good if in addition to specifying a namespace prefix, we can specify a label that's only applied to the target metric cc @dashpole

@dashpole
Copy link
Contributor

What resource attributes are attached to your OTLP metrics when you send them? The set of resource attributes should uniquely identify the sender, otherwise there would be collisions even if there was one collector per-sender.

If resource attributes uniquely identifies the sender, the current timestamp logic should work. If it sends batch 1, and then batch 2, the timestamp of batch 1 will be before batch 2, since the datapoint timestamps of points in batch 1 are all before those in batch 2. There is always the chance that the two batches will be reordered in transit, but there is that chance for any metric.

It seems that the prometheusreceiver converts scrape target label as a datapoint attribute instead of as resource attribute (which seems more appropriate imo)

Can you expand on this? Labels on the target_info metric should be converted to resource attributes.

@dashpole
Copy link
Contributor

also cc @Aneurysm9

@hardproblems
Copy link
Contributor

hardproblems commented Jul 19, 2022

It seems that the prometheusreceiver converts scrape target label as a datapoint attribute instead of as resource attribute (which seems more appropriate imo)

Here's my test collector config:

exporters:
  file/debug:
    path: fileexporter_output
receivers:
  prometheus/otelcol:
    config:
      scrape_configs:
      - job_name: otelcol
        scrape_interval: 1s
        static_configs:
        - labels:
            pod: testpod
          targets:
          - localhost:10694
service:
  pipelines:
    metrics/prometheus:
      exporters:
      - file/debug
      receivers:
      - prometheus/otelcol
telemetry:
    logs:
      level: INFO
    metrics:
      address: localhost:10694
      level: detailed

I looked at the metrics in fileexporter_output and saw something like the following

{"resourceMetrics":[{"resource":{"attributes":[{"key":"service.name","value":{"stringValue":"otelcol"}},{"key":"service.instance.id","value":{"stringValue":"localhost:10694"}},{"key":"net.host.port","value":{"stringValue":"10694"}},{"key":"http.scheme","value":{"stringValue":"http"}}]},"scopeMetrics":[{"scope":{},"metrics":[{"name":"up","description":"The scraping was successful","gauge":{"dataPoints":[{"attributes":[{"key":"pod","value":{"stringValue":"testpod"}}],"timeUnixNano":"1658193682695000000","asDouble":0}]}},{"name":"scrape_duration_seconds","description":"Duration of the scrape","unit":"seconds","gauge":{"dataPoints":[{"attributes":[{"key":"pod","value":{"stringValue":"testpod"}}],"timeUnixNano":"1658193682695000000","asDouble":1.001007048}]}},{"name":"scrape_samples_scraped","description":"The number of samples the target exposed","gauge":{"dataPoints":[{"attributes":[{"key":"pod","value":{"stringValue":"testpod"}}],"timeUnixNano":"1658193682695000000","asDouble":0}]}},{"name":"scrape_samples_post_metric_relabeling","description":"The number of samples remaining after metric relabeling was applied","gauge":{"dataPoints":[{"attributes":[{"key":"pod","value":{"stringValue":"testpod"}}],"timeUnixNano":"1658193682695000000","asDouble":0}]}},{"name":"scrape_series_added","description":"The approximate number of new series in this scrape","gauge":{"dataPoints":[{"attributes":[{"key":"pod","value":{"stringValue":"testpod"}}],"timeUnixNano":"1658193682695000000","asDouble":0}]}}]}]}]}

Version: 0.52.0

Expected: the static target's label (pod: testpod) should be a resource metric attribute.
Actual: pod label is a datapoint attribute.

This means that when the metric is exported by the prometheusremotewriteexporter, the label that helps distinguish the target (ie pod) will not be added as a label to the target metric, causing multiple receivers's target metric to collide with each other and result in out of order samples.

Essentially, because the target metric is created by the prometheusremoteexporter from resource attributes only, we cannot use a processor to selectively add a label to only the target metric. We can use a processor to add a resource attribute, which means adding the attribute to all timeseries, not just target. This may cause undesirable cardinality increase. So it would be ideal if there's a way to configure additional labels for the target metric to avoid collision without requiring such a label to be a resource attribute.

@dashpole
Copy link
Contributor

I think we are probably doing the right thing in the prometheus receiver... The labels config is supposed to add a label to every metric series. If there was an incoming target_info metric, it should add the label to that before it is converted to a resource attribute. #9967 (comment) is also related, but I think there probably needs to be an opt-out for generating target_info in the exporter for cases like these.

@hardproblems
Copy link
Contributor

opt-out for generating target_info also works since prometheusreceiver already generates up metric which should have the same purpose as target. 🙇

@jmacd
Copy link
Contributor

jmacd commented Jul 26, 2022

There appears to be a "single-writer principle" violation happening. I understand the desire to opt-out, but can we somehow address the underlying problem as well?

@hardproblems
Copy link
Contributor

hardproblems commented Jul 27, 2022

we have a setup where the otlp sender pod is responsible for always sending a timeseries to the same collector pod in a k8s replicaset (using consistent hashing), therefore our otlp sender doesn't append its own info like the sender's k8s pod name as a resource attribute. There can be hundreds of otlp senders. If we added pod name it would increase the cardinality of all senders' timeseries 2x when senders are redeployed, therefore we do not want to add it to every time series, but this unfortunately means the target info metric from all the senders are indistinguishable. Another option for us is to only append a unique label to only the target metric, but this would be against the spec

@hardproblems
Copy link
Contributor

hardproblems commented Jul 27, 2022

This is kind of similar to #11870. There seem to be use cases where for one reason or another, the sender's unique ID isn't necessary or desirable as a resource attribute. But not having unique sender ID would always result in the target metric from multiple senders conflicting, whether it's multiple senders sending to one collector, or each sender sending to a different collector but the collectors all writing to the same Prometheus. Hence the need to either disable target metric, or allow a label to be added to only the target metric and not others

@dashpole
Copy link
Contributor

increase the cardinality of all senders' timeseries 2x when senders are redeployed

Is that actually the case? Won't it only increase the cardinality of the target_info series? I wouldn't expect that series to represent a meaningful portion of overall cardinality in most cases.

@jmacd
Copy link
Contributor

jmacd commented Jul 27, 2022

I'm not convinced that the implementation of target_info is correct. I believe the single-writer principle is broken by the current behavior. The single-writer principle is consistent with Prometheus requirements, and it should be much harder to break than this. 😁

(I couldn't find where the specification talks about this detail, does it? Forgive me if the following is incorrect.)

From inspection, the Prometheus receiver inserts the target_info and populates it with this code:

...
	// For the `target_info` metric we need to convert it to resource attributes.
	metricName := labels.Get(model.MetricNameLabel)
	if metricName == targetMetricName {
		return 0, t.AddTargetInfo(labels)
	}
...

func (t *transaction) AddTargetInfo(labels labels.Labels) error {
	attrs := t.nodeResource.Attributes()

	for _, lbl := range labels {
		if lbl.Name == model.JobLabel || lbl.Name == model.InstanceLabel || lbl.Name == model.MetricNameLabel {
			continue
		}

		attrs.UpsertString(lbl.Name, lbl.Value)
	}

	return nil
}

On the surface, it appears that removing the job and instance label from the target_info is the cause of this conflict. How can the target_info otherwise be joined with the metrics for which these resources are related?

If I were asked to design this feature today, I would probably remove it from prometheusreceiver and put the "fix" in prometheusremotewriteexporter. I would assert that job and instance attributes belong in the OTLP Resource attributes section. They should be left in the target_info, and they should be copied into the metric labels by the exporter.

That way, OTLP consumers will not see "job" and "instance" attributes repeated in every metric exported by a Prometheus receiver. Is there an issue or design discussion somewhere we can refer to for more information on this topic?

@dashpole
Copy link
Contributor

dashpole commented Jul 27, 2022

Job and instance are converted to service.name and service.instance.id in the prometheus receiver, and are converted back (and added as metric labels) in the PRW exporter, so they should be correctly preserved on the target_info metric. In the issue above, job + instance were not unique (e.g. instance was localhost:10694, and was the same for many collectors), which caused this issue.

OTLP consumers will not see "job" and "instance" attributes repeated in every metric exported by a Prometheus receiver

That is currently the case, since job and instance are converted to service.name and service.instance.id in the receiver.

See prom->otlp and otlp->prom for details. It was proposed in open-telemetry/opentelemetry-specification#2381.

@jmacd
Copy link
Contributor

jmacd commented Jul 27, 2022

Thank you @dashpole. Sorry for mis-reading the code. I reviewed prom->otlp and otlp->prom specs and remembered 2381 (however, will add that the string "target_info" does not appear in the spec today?).

@hardproblems it sounds like you are looking for "Safe attribute removal" in the collector, but I'm still a bit confused.

If job and instance are not unique in your setup, is it also true that your application metrics have no overlapping attribute sets? You can avoid a single-writer violation by avoiding overlap, but this means you're maintaining uniqueness across senders. Is this true?

If you've got not overlapping application metrics, then "job" and "instance" are not necessarily identifying, so they can be safely removed without special aggregation. This will explain why you only see errors for target_info. If this is correct, I agree that a simple option to avoid target_info is easy and correct. You could also just drop that metric.

In the hypothetical safe-attribute-removal processor documented in open-telemetry/opentelemetry-specification#1297, under these circumstances with N identical senders, the target_info metric would appear first as an N non-monotonic-sum 1 values. The attribute removal process would replace this by a single target_info having the value N.

@hardproblems
Copy link
Contributor

hardproblems commented Jul 27, 2022

If job and instance are not unique in your setup, is it also true that your application metrics have no overlapping attribute sets? You can avoid a single-writer violation by avoiding overlap, but this means you're maintaining uniqueness across senders. Is this true?

Correct, we use consistent hashing to ensure a unique timeseries is only ever sent to one collector so we only see out of order samples on the target metric. We run hundreds of OTLP senders which send millions of metrics time series, so adding unique sender ID and causing sender deploys to 2x the cardinality is something we'd like to avoid.

Safe removal of attributes is a possible alternative, if we knew ahead of time which resource attributes there are. This is the case for us today so I can give that a try. Will there never be any cases in the future where resource metrics are dynamically created (maybe derived from spans) where the attributes are not known at configuration time?

@dashpole
Copy link
Contributor

however, will add that the string "target_info" does not appear in the spec today?

Yes, I even managed to confuse myself on that front: #12079 (comment). I'll update the spec to reference target_info (rather than an info-typed metric in the target MetricFamily).

@jmacd
Copy link
Contributor

jmacd commented Jul 27, 2022

Correct, we use consistent hashing to ensure a unique timeseries is only ever sent to one collector so we only see out of order samples on the target metric.

Sending to a unique collector does not necessarily avoid a single-writer violation. Since your metrics are exiting the collector through PRW, I assume they end up in a database where you will ask PromQL queries? Single-writer violations can still show up at this point. You are either vulnerable to this, or you have somehow ensured that your timeseries do not overlap aside from job and instance which are not unique.

For example, Process A writes counter values [10, 11, 12, 13, 14], and Process B writes counter values [0, 1, 2, 3, 4]. It's possible for these to be interleaved in the storage layer so that the database contains [0, 10, 1, 11, 2, 12, 3, 13, 4, 14] if they actually use the same attributes (and you will not be able to correctly compute the rate from this data).

@hardproblems
Copy link
Contributor

hardproblems commented Jul 27, 2022

oh i meant that we're using consistent hashing upstream of the collector to maintain the uniqueness guarantee. A time series (unique combinations of metric name and label values)'s data points from multiple senders will only get sent to one collector. An example of a metric sender is a pod in a replicated load-balanced service that want to report the latency of requests that it served. We don't care about the sender's identity, only the overall percentile / histogram of request latencies see by all senders. The metric sender will use consistent hashing to send each unique time series to a single collector pod so they are aggregated before being send to Prometheus. However, opentelemetry-go sdk adds resource attributes like telemetry_sdk_language and telemetry_sdk_version which results in a target metric getting generated and out of order samples on this single metric. We want accurate SLOs using otelcol_exporter_sent_metrics and otelcol_exporter_sent_failed_metrics but have to hack around this because of the target metric

@jmacd
Copy link
Contributor

jmacd commented Jul 27, 2022

We don't care about the sender's identity, only the overall percentile / histogram of request latencies. The metric sender will use consistent hashing to send each unique time series to a single collector pod.

I don't think you've explained how this avoids the single-writer violation. It sounds like you might have invalid timeseries data, if multiple collectors receive overlapping data and write it to the same Prometheus remote-write backend. Prometheus cannot correctly compute rates from interleaved timeseries (actually overlapping).

However, opentelemetry-go sdk adds resource attributes like telemetry_sdk_language and telemetry_sdk_version which results in a target metric getting generated and out of order samples on this single metric.

I understand now, and I support the option to not generate target_info as a simple workaround. It sounds like you could add a resource processor and drop all the superfluous attributes, at which point the target_info would contain precisely job and instance, at which point it becomes useless. I would definitely support logic to avoid target_info automatically when it contains no useful information.

Stepping back, I filed the issue about safe attribute removal because I absolutely think the collector should support what you're trying to do, i.e, when you send overlapping metrics on-purpose. It's not a trivial task, however--you end up making tough decisions about how to handle late arriving data.

In a Statsd metrics system, it would be commonplace to do the kind of aggregation you are describing, but that relies on a feature of the Statsd protocol: points are not timestamped. They arrive "not late" by definition. We can impose a condition similar to statsd that data is "not late" by definition in order to simplify the processor needed to correctly, safely remove attributes from metrics data in a collector.

@hardproblems
Copy link
Contributor

I don't think you've explained how this avoids the single-writer violation. It sounds like you might have invalid timeseries data, if multiple collectors receive overlapping data and write it to the same Prometheus remote-write backend. Prometheus cannot correctly compute rates from interleaved timeseries (actually overlapping).

Would you please link to the doc on single-writer principle? My understanding of that is a unique time series should only be sent by a single writer, which we can guarantee in our environment.

@jmacd
Copy link
Contributor

jmacd commented Jul 27, 2022

Here it is: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/data-model.md#single-writer

I don't want to make it sound like you're doing something wrong, I was just having trouble understanding how your series are unique when instance and job are not unique. This is possible when each distinct process ensures its attributes are somehow globally unique. The important point is that if all of your collectors are writing to the same Prometheus remote write endpoint, sending to unique collectors won't avoid single-writer violations.

@hardproblems
Copy link
Contributor

hardproblems commented Jul 27, 2022

application pods --> metric aggregator pods --> per-pod otel-collector is a very high level description of our setup.
Our metrics aggregators receive statsd metrics from applications. They aggregate metrics across application pods in order to remove application pod name as a label. Application pods use consistent hashing to ensure that for a unique time series, its datapoints are only sent to one aggregator pod (aggregator pod names are used as virtual nodes in the hash ring, and a time series (unique metric identity) belongs to node). ie No two aggregator pods will ever receive data points from the same time series (metric identity) at the same time. The aggregator pod then converts statsd to otlp using otel-sdk, and sends the time series to a otel-collector in its pod. The instance/job labels are added by the collector (from what i can tell translated from service.id, and service.name). The aggregators run identical instrumentation (so their resource attributes are the same across many pods). To distinguish between aggregators, we would need to add the aggregator's pod as a resource attribute, which we don't want to do to reduce cardinality spikes. Yes there are many aggregator pods (multiple writers), but for a single time series, only one aggregator will send data points to the collector at a given time. I don't think it violates the single writer principle. If it did, we would see out of order sample errors for other time series. We only see it for the target metric.

@jmacd
Copy link
Contributor

jmacd commented Jul 28, 2022

That makes sense. Thank you for being specific and bearing with me. We've fully explained why only the target_info metric is a problem and your setup is a reason we should support suppressing target information.

Definitely, the intention behind a "safe attribute removal processor" is to support what you're doing in the aggregator pods as an otel-collector pipeline instead. Thanks!

@mrblonde91
Copy link

Question around this bug, I think we're encountering it, in the process of upgrading to the latest contrib. Has this potential to negative impact other metrics? Cause we've noticed that things like counters in prometheus invariably seem to be 10% off so I'm wondering if they're getting swallowed as a result of this.

@dashpole
Copy link
Contributor

@mrblonde91 No, this shouldn't impact any other metrics.

@hardproblems
Copy link
Contributor

@badjware @dashpole can we close this issue now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants