Exporting Low/High Cardinality Metrics #5308

hannahchan · 2022-05-04T01:08:50Z

hannahchan
May 4, 2022

Hi,

I'm wondering if it is possible for the OpenTelemetry Collector to do this and if it is, how do I configure the collector to do it.

My goal is to send a low cardinality version of a metric to a monitoring tool like Prometheus and the high cardinality version of the same metric to a data lake for offline analysis in a tool like PySpark without further or minimal additional instrumentation.

For example, I want to take metrics like this;

http_requests_received_total{ method="GET", code="200", path="/abc/123" } 10
http_requests_received_total{ method="GET", code="200", path="/xyz/123" } 20
http_requests_received_total{ method="GET", code="200", path="/abc/456" } 30

And drop the high cardinality label and aggregate to get this;

http_requests_received_total{ method="GET", code="200" } 60

This aggregated metric would then be sent to a monitoring tool while the raw unchanged metrics would go to our data lake.

Context

We have a large multi-tenanted application where tenant sizes can range from a single-digit number of users to large tenants with 20,000+ users. Unsurprisingly, larger tenants have more data in our application. We want to understand why our larger tenants experience performance degradation and what about their data that impacts this. To do this, we want to collect metrics at a tenant level to allow our data analyst to identify long-term trends.

bogdandrutu · 2022-05-04T11:54:11Z

bogdandrutu
May 4, 2022
Maintainer

Hi @hannahchan ,

There are couple of things here, and unfortunately it depends a bit on the backend that you will send the data (not sure if Prometheus was an example or the real destination, the only things that matters is if it supports or not "delta" counters (prometheus only supports "cumulative" counters).

Let's also start with the input data:

Does the metric come from multiple sources (I expect yes)? Do you "create" conflicts between MTS from different sources after you drop that dimension (a.k.a are you removing any label that uniquely identify the source (I hope for no)?
Does you metric come as a "cumulative" counter (in otel world as a monotonic cumulative sum)? Can you change the source (and it does not affect your data lake) to emit "delta" counter?

I know that I asked lots of questions, but solutions are different based on the answers. As an example if your backend does not support "delta" counters, the solution is very complex, because (most likely) you cannot handle all traffic with only one collector instance, and you need to do "stateful" sharding since you need to guarantee that all timeseries (combination of metric name + dimension + value) will go to the same instance of the collector in order to correctly calculate the cumulative (always increasing) value.

2 replies

hannahchan May 4, 2022
Author

Hi @bogdandrutu,

Thank you for the response. Prometheus was an example as it's the monitoring tool I'm most familiar. I was trying to remain source and destination agnostic.

The short answers to your good questions are;

We don't intend on aggregating metrics across sources or dropping source labels.
The data lake and its consumers can accept a cumulative or delta based counter.

The real application I am currently working with is actually a stateless monolith running on many instances in production instrumented with StatsD. I am exploring if it would be possible to standup an OpenTelemetry collector as a side car for each instance with the StatsD receiver to replace our existing metrics pipeline that can accomplish the above. I also want to reuse this solution (as much as possible) for newer applications that will be instrumented to push traces/metrics/logs via OTLP.

For now I'm trying to build a proof-of-concept. From the OpenTelemetry collector, I want to export aggregated metrics to Prometheus (or a more suitable backend) while at the same time land the unaggregated / raw metrics into an AWS S3 bucket or AWS Kinesis Data Stream. At this point in time I don't care about the exporting data format. The consumers of the S3 bucket or Kinesis stream still need to be built so the data format is very flexible right now.

From your comment above it sounds like it's best to avoid cumulative counters? I actually haven't given much though about the impact his has on scaling.

CatherineF-dev Jul 18, 2022

Is it possible to export all metrics to prometheus format? And then use prometheus server to scrape these metrics, query high cardinality metrics. cc @bogdandrutu

bogdandrutu · 2022-05-04T17:05:20Z

bogdandrutu
May 4, 2022
Maintainer

First let's talk about the routing part, you have couple of options:

The loadbalancing exporter then you can configure this to route to different exporters configured by you, which probably will not work that great.
The second option is to setup 2 different pipelines from your "statsd receiver" with their own exporters (first with prometheus and second with data lake), that will cause the metrics data to be pushed to both pipelines (cloned). Now on the prometheus pipeline you can configure some processors to do the calculation you need. You may also use the filter processor on both pipelines to control things that you don't want to send to one of the exporters.

Calculating what you need, you have some options here:

Write a very simple processor, since you receive these data from statsd receiver, all timeseries associated with that source will be in one "pmetri.Metric" object, means you can just do a simple sum between all datapoints that have same "kept" labels.
You can use https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/metricstransformprocessor#aggregate-label-values to do your task, unfortunately this processor is not ported from OpenCensus days (means otel -> opencensus -> otel conversions happens)

3 replies

bogdandrutu May 4, 2022
Maintainer

From your comment above it sounds like it's best to avoid cumulative counters? I actually haven't given much though about the impact his has on scaling.

The fact that you are close to the "source" and guarantee that all timeseries affected by the transformation that will result in a "conflict/merge" are passing via the same instance of the collector, simplifies things and allow you to use cumulative counters as well.

hannahchan May 5, 2022
Author

Just a few more questions from me.

What is a "petri.Metric" object? I don't think I've come across the term before.
What are the concerns around using metricstranformprocessor? Are there performance concerns around the conversions? Is there any others?

bogdandrutu May 5, 2022
Maintainer

https://github.com/open-telemetry/opentelemetry-collector/blob/main/pdata/pmetric/generated_alias.go#L93
metricstranformprocessor - performance, and the effort that we have to consolidate the syntax (means configuration may change dramatically)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exporting Low/High Cardinality Metrics #5308

{{title}}

Replies: 2 comments 5 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Exporting Low/High Cardinality Metrics #5308

hannahchan May 4, 2022

Replies: 2 comments · 5 replies

bogdandrutu May 4, 2022 Maintainer

hannahchan May 4, 2022 Author

CatherineF-dev Jul 18, 2022

bogdandrutu May 4, 2022 Maintainer

bogdandrutu May 4, 2022 Maintainer

hannahchan May 5, 2022 Author

bogdandrutu May 5, 2022 Maintainer

hannahchan
May 4, 2022

Replies: 2 comments 5 replies

bogdandrutu
May 4, 2022
Maintainer

hannahchan May 4, 2022
Author

bogdandrutu
May 4, 2022
Maintainer

bogdandrutu May 4, 2022
Maintainer

hannahchan May 5, 2022
Author

bogdandrutu May 5, 2022
Maintainer