-
Notifications
You must be signed in to change notification settings - Fork 167
Counter, UpDownCounter, and Gauge instruments compared #156
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few overall comments:
- the introduction feels too long and doesn't really make it clear what the problem is -- I think that's the weakest part right now
- I would suggest adding concrete examples of real-world metrics for each of the instrument types, to help example-based-learners (like me) to understand more easily
|
||
## Background | ||
|
||
OpenTelemetry has a founding principal that the interface (API) should |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tiny nit: 'principal' => 'principle'
used with Counter and UpDownCounter instruments, there is an assumed | ||
relationship between the aggregation temporality and the choice of | ||
synchronous or asynchronous API. Inputs to synchronous | ||
(UpDown)Counter instruments are the changes of a Sum aggregation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thought: I find the phrase changes of a Sum aggregation
difficult to mentally parse. What about adding the word incremental
to more explicitly distinguish it from the behavior of an asynchronous counter, e.g.
Inputs to synchronous (UpDown)Counter instruments are the incremental changes to a Sum aggregation (i.e., deltas)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1.
I would suggest that we avoid "incremental" since it might suggest monotonicity (or we will have to put something like "incremental/decremental").
My suggestion would be "Inputs to synchronous (UpDown)Counter instruments are the deltas, while inputs to asynchronous instruments are always the absolute value (e.g. the total sum)."
synchronous or asynchronous API. Inputs to synchronous | ||
(UpDown)Counter instruments are the changes of a Sum aggregation | ||
(i.e., deltas). Inputs to asynchronous (UpDown)Counter instruments | ||
are the totals of a Sum aggregation (i.e., cumulatives). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be useful to explain (or link to an explanation of) why synchronous Counters are assumed to be deltas vs why async Counters are assumed to be cumulative, to help the reader follow the reasoning of why those things are true (or common).
Sum points imply a linear scale of measurement. A Sum value that is | ||
twice the amount of another actually means twice as much of the | ||
variable was counted. Linear interpolation is considered to preserve | ||
meaning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might consider adding a concrete example of a real life metric to help the reader follow the precise but very technical description above (maybe even a diagram, if it lends itself to that). There are some examples I've seen in other places that you could probably just steal, e.g. https://lightstep.com/blog/opentelemetry-101-what-are-metrics/
I'm thinking that would help people like me who learn more readily through examples. Same below with Gauge events.
|
||
## New measurements: Counter and UpDownCounter instruments | ||
|
||
Sum points have been defined to have linear scale of measurement, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had some trouble deciphering the phrase "linear scale of measurement". I understand the concept, after reading the rest of the paragraph, but I wonder if there's a different way to say this, maybe by adding an explanatory aside, something like this (although it probably doesn't use the right terminology)
"Sum points have been defined to have linear scale of measurement, meaning the same Sum point value could be obtained through many different combinations of metric event values. This property can also be applied in reverse, meaning that Sum points can be subdivided."
instrumentation library with or without subdivided Sums and to | ||
meaningfully aggregate data with a mixture of attributes. | ||
|
||
## New measurements: Gauge instruments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! This section is super clear and easy to understand.
nit: "New measurements" => "Adding new measurements" for clarity? Same below.
The OpenTelemetry Metrics API introduces a new kind of instrument, the | ||
UpDownCounter, that behaves like a Counter, meaning that attributes | ||
subdivide the variable being counted, but their primary interpretation | ||
is like that of a Gauge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point I'm thinking, "ok, that's nice, but why should I care?"
I would suggest starting this doc at the very top by briefly illustrating the problem that exists when you only have Counter and Gauge instruments, to help the reader see that things aren't fine the way they are, and motivate them to understand the problem and care about this solution.
Actually, you might just start with the use-case that you wrote up for me in our slack chat about this.
The user journey I'm after is this:
1. As a user I know nothing about metrics named random_oss_software.* but they were produced with OTel SDKs.
2. I click "Create a dashboard"
3. I enter the "random_oss.software.*" pattern
4. I get a good dashboard.
and then show at least one scenario (doesn't have to be exhaustive) where this is not possible to do without the UpDownCounter instrument.
Then you could move into the rest of the doc as written.
the meaning in the original events is preserved in the complete set of | ||
timeseries. | ||
|
||
Removing attributes from metric streams without changing meaning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest moving this down below the example, and adding a sentence afterward to draw attention to the problem, e.g.
"...which means applying the natural aggregation function to merge metric streams. But how do we know which is the "natural" aggregation function? For Counters the answer is always SUM (for: reason), however for Gauges it might be SUM or MEAN depending on the semantics of the values represented by the particular metric. (for: reason). Therein lies the problem."
- Ratios are not necessarily meaningful | ||
- Linear interpolation is not necessarily supported. | ||
|
||
## Attributes are used for interpretation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section header doesn't quite seem to flow naturally with the text, maybe call this section something
like "Adding attributes to events for additional meaning" to help tie it to previous and subsequent sections.
_Metric timeseries_ is the output of aggregating a stream of data | ||
points for a specific set of resource and attribute dimensions. | ||
|
||
## Meaning and interpretation of Counter and UpDownCounter events |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if this is what you're going for, but if your main goal is to explain the need for UpDownCounter, it might make sense to omit it from this section and wait to present it later as a solution to the problem described in this doc. If that's the goal, it feels a bit premature to describe it here, because I as a reader don't yet know why it's necessary or useful and I get a bit lost.
|
||
_Metric data stream_ is a collection of data points, written by a | ||
writer, having an identity that consists of the instrument's name, the | ||
instrumentation library, resource attributes, and metric attributes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this meant to help define an interface/standard or simply saying that a data stream encompasses all of the data and the source?
|
||
_Metric data points_ are the items in a stream, each has a point | ||
kind. For the purpose of this text, the point kind is Sum or Gauge. | ||
Sum points have two options: Temporality and Monotonicity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Temporality is previously defined, but not Monotonicity.
measurements. No implied linear scale of measurement, therefore: | ||
|
||
- Rate-of-change may not be well defined | ||
- Ratios are not necessarily meaningful |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These statements feel too strong to me. I can think of instances where gauges are very useful in context of rate, ratios, and trend modeling. However, I think it's important to say that Gauges are signed with their own time stamp. Valuable analysis likely needs some data alignment (aggregating by some method to regular time intervals) in order to make meaningful comparisons
|
||
The same aggregation can be applied when removing an attributes from | ||
metric streams forces reaggregation. The most current value should be | ||
selected. In case of identical timestamps, a random value should be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The random value sample for time collisions is an important decision that should be clearly documented. There's a lot of behavior that surfaces when this happens that is difficult to understand. I do think it's a good solution, just needs to be clearly documented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: removing an attributes
=> either "removing an attribute" or "removing attributes"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are several things that are hard for me to understand. It is very likely that this is the case because of how little do I know of the concepts of metrics and the terminology used to describe them. Nevertheless, maybe this can be useful information to the author in case it is possible to rephrase some of this OTEP in order to make it a bit more clear to more inexperienced readers 🙂
|
||
## Glossary | ||
|
||
_Meaning_: Metrics API events have a semantic definition that dictates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be a bit easier to understand if the glossary also had an entry for Metrics API events.
Metrics API instruments. For the purpose of this text, it is a | ||
Counter, an UpDownCounter, or a Gauge. | ||
|
||
_Metric attributes_ can be applied to Metric API events, which allows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This describes what can be done with metric attributes (can be applied to metric API events), but it does not actually define what metric attributes are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would imagine attributes are specific to each metric so the spec wouldn't define what the attributes are. This is what is referred to as labels in the ASCII art in the spec but the agreed-upon name going forward is attributes.
attribute dimensions. | ||
|
||
_Metric data stream_ is a collection of data points, written by a | ||
writer, having an identity that consists of the instrument's name, the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is written by a writer redundant?
attribute dimensions. | ||
|
||
_Metric data stream_ is a collection of data points, written by a | ||
writer, having an identity that consists of the instrument's name, the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...having an identity that consists of the instrument's name
What instrument is this? How is this instrument related to the metric data stream?
attribute dimensions. | ||
|
||
_Metric data stream_ is a collection of data points, written by a | ||
writer, having an identity that consists of the instrument's name, the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Who has an identity that consists of the instrument's name, the instrumentation library, resource attributes, and metric attributes? The Metric data stream or the data points?
|
||
The rate interpretation is preferred for monotonic Sum points, and the | ||
the current total interpretation is preferred for non-monotonic Sum | ||
points. Both interpretations are meaningful and useful for both kinds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bit confused here, first this says that one interpretation is preferred for a certain kind of sum point and the other one is preferred for the other kind of sum point. Then it says that both are "meaningful and useful" for both kinds of sum point. Then, why is one preferred over the other? 🤷
|
||
Sum points imply a linear scale of measurement. A Sum value that is | ||
twice the amount of another actually means twice as much of the | ||
variable was counted. Linear interpolation is considered to preserve |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, what do you mean with "linear interpolation is considered to preserve meaning"? Do you mean that it is possible to interpolate linearly between two sum points?
Gauge instruments produce Gauge metric data points are taken to | ||
have meaning in a metric stream as follows: | ||
|
||
- Gauge point values are individual measurements captured at an instant in time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't sum points also individual measurements captured at an instant in time?
synchronous and asynchronous measurements. When recording Gauge | ||
values through a synchronous API, the interpretation is "last known | ||
value", and when recording Gauge values through an asynchronous API | ||
the interpretation is "current value". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, "last known value" and "current value" are very similar. It is hard to understand the difference between synchronous gauge and asynchronous gauge using this concept.
the interpretation is "current value". | ||
|
||
The distinction between last known value (synchronous) and current | ||
value (asynchronous) is considered not significant in the data model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, but it is significant for the reader of this document to understand the difference between synchronous gauge and asynchronous gauge...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like a lot things in this document, but I think it suffers from a few things.
-
the bottom section outlining when to use Gauge vs. UpDownCounter is gold and should be teased apart or put first.
-
The notion of "safe operations" to perform regarding Attributes (like removal/addition) should be formalized as part of the DataModel. I think you took a good first crack, but we may want to flesh out ALL of these things so that it's clear what operations are allowed in "telemetry schema" and which are considered "compatible"
-
The distinction of "last value" vs. "current value" is (currently) a bit confusingly worded. It also makes me wonder if we should represent SyncGauge differently in OTLP regarding "start timestamp" and "current timestamp". E.g. should we ALWAYS have sync gauge use the timestamp when value was recorded vs. when it is reported externally?
zero with each interval (_delta_) or accumulated over a sequence of | ||
intervals (_cumulative_). Both forms of temporality are considered | ||
important, as they offer a useful tradeoff between cost and | ||
reliability. The data model specifies that a change of temporality |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're talking version of a service, possibly. I could code service A to use DELTAs, then decide to switch to cumulatives in version 2.
Practically you shouldn't expect a service to be going back and forth between delta + cumulative points.
Note that these two statements imply different interpretation for | ||
synchronous and asynchronous measurements. When recording Gauge | ||
values through a synchronous API, the interpretation is "last known | ||
value", and when recording Gauge values through an asynchronous API |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I buy this. In a synchronous Gauge, recording a value is "Here's the current value".
The difference in Async vs. Sync is whether you can have the "current value" or "last known value" be sampled on-demand.
i.e. I think the terminology here is being phrased from an SDK/exporter perspective vs. the instrument's perspective (which I assume is closer to a user of metrics API)
the interpretation is "current value". | ||
|
||
The distinction between last known value (synchronous) and current | ||
value (asynchronous) is considered not significant in the data model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're implicitly saying that when you export metrics, you're setting the Timestamp for a synchronous Gauge to "export time", whereas the actual timestamp was when the synchronous instrument sent the last piece of data.
E.g. Reproting INterval
| X X |
t0 t1 t2 t3
Here, the synchronous gauge is reporting its value at t1 and t2. For the collection interval t0 -> t3, we report the value recorded at t2 BUT you're saying for DELTA + CUMULATIVE it would report its timestamp at t3, whereas an asynchronous instrument would just capture a value at t3.
I think this needs some kind of picture.
data requires attention to the meaning being conveyed. | ||
|
||
Safe attribute erasure for OpenTelemetry Metrics streams is specified | ||
in a way that preserves meaning while removing only the forms of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if "removing the forms of interpretation that made use of the erased attribute" basically is destroying the practical value of the metric?
I think this is implying that removing an attribute from a metric is an ok thing to do (it' s not). The meaning that IS preserved is relevant to reaggregation + processing, but could be disastrous to dashboards and users.
|
||
In database systems, this process is refered to as a performing a | ||
"Group-By", where aggregation is used to combine streams within each | ||
distinct set of grouped attributes. For the benefit of OpenTelemetry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This kind of falls over for me in practice. Database Group-By is a full query language where NORMALLY you need to provide your own Aggregation semantic for every column of data or the query will fail.
What we're suggesting here is instead of:
SELECT AVG(count) FROM x GROUP BY label
The default aggregation function is infferred by otel based on the metric type.
Two things:
- We shouldn't prevent users from using different aggregation functions in practice. Indeed, once they make it to PromQL (or MQL for GCP), they'll be able to do whatever they want here and need to track their own meaning.
2.I assume a lot of the gymnastics we go through around default aggregation is to avoid exposing a query-like langauge for "rewrite rule" style collection behavior.
My $.02 here is the focus on giving users a way to solve "rewrite rules" use cases is good. Making it as easy as possible is good. If we can't explain what is and isn't safe in very simple terms, we might be in trouble. If the "meaning" we retain isn't the one users wanted to retain, then we're not really adding value.
When conveying meaning to the user by grouping and aggregating over a | ||
subset of attribute keys, the default aggregation selected should be | ||
one that preserves meaning. For monotonic Counter instruments, this | ||
means conveying the combined rate of each group. For UpDownCounter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does monotonic counter instruments turn into a rate? Did I miss an above description on how UpDown vs. Counter have different aggregation meaning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this description?
I will close this until such a point in time as someone convinces me it's needed. |
Strong argument for exporting an Asynchronous UpDownCounter to OTLP as a non-monotonic sum rather than as a gauge?In Stack Overflow, I recently asked the question A Stack Overflow user commented on my question with a link to this (closed) OTEP. (Thank you! 👍) In briefWhile this OTEP is interesting, I didn't find an answer here to that question. I invite answers to that question; either here, or (I guess, preferably) in Stack Overflow. (I'm interested in the answer, wherever it might be.) In detailI actually have someone asking me now, "Why bother exporting values from Asynchronous UpDownCounter and Asynchronous Gauge instruments differently in OTLP? Why not just export them all in the OTLP as gauges?" I don't have a good answer for them. As I wrote in that Stack Overflow question:
No offense intended, but the following, from this OTEP:
sounds similarly weak to me ("are encouraged"). Especially, given this in the OTEP:
whereas, the Collector transform processor seems to have no problem applying a sum function to gauges; that is, combining multiple Gauge measurements using addition. Which, according to this OTEP, isn't a meaningful operation. (Please correct me if I'm wrong! I confess I haven't tried that in practice with gauges in the transform processor; for now, I'm just reading the transform processor docs.) |
Non-monotonic sum has additional semantics that gauge doesn't provide. Keeping such semantic will help downstream consumers to pick a better experience if they want to do so.
I want to understand more about this part "sounds like weak sauce". Would you share more about what makes you feel this is weak? |
Hi @reyang, Thank you so much for responding to my comment, much appreciated. You wrote:
Yes. I understand and acknowledge that. Specifically, in terms of the representation of that difference in OTLP (I doubt this is news to you, but just so you know that I know 🙂):
But, what do you say to someone who shrugs at that "additional semantics" statement, and says, "So what? Show me an example where that makes a practical difference! Otherwise, I see no practical reason to make that distinction in the OTLP; I'm going to continue exporting non-monotonic sums to OTLP as I went looking at OpenTelemetry Collector, expecting to find a clear example. Something that would stand out to me, that I could point to and say, "There! This is why the distinction is important in practice!". But, so far, I've failed. In fact, what I've seen indicates that the OpenTelemetry Collector transform processor agnostically gives users the freedom to pick a "bad" experience; to perform aggregrations that this OTEP characterizes as not meaningful; to add multiple gauge values. From the OpenTelemetry Collector transform processor docs:
You wrote:
I can't see a strong practical reason to make this distinction in the OTLP. I want to, but I can't find one. In particular:
Can you point me to actual examples of where these different representations in OTLP ( |
I think telemetry vendors are in a better position to provide such examples (and this is where vendors actually compete). |
Re: Strong argument for exporting an Asynchronous UpDownCounter to OTLP as a non-monotonic sum rather than as a gauge? The strongest argument I can think of right now is based purely on semantics, rather than the behavior of any consumer. By representing a numeric value in OTLP as a
That is, the values are non-additive. Whereas, non-monotonic cumulative sums can be meaningfully combined using addition. The values are additive. Suppose an OTLP exporter faithfully exports non-additive values (from an Asynchronous Gauge instrument) to the big, wide world as The first stop in the outside world might be OpenTelemetry Collector. Collector says, "You wanna aggregate those gauges by adding them together? No problem, buddy." The last stop is a telemetry backend. All the backends I know say, "Non-monotonic cumulative sum, gauge, I don't care. It's all the same to me. You wanna add temperatures together? That's your business. Knock yourself out." So much for semantics. I've spent the last few years helping to forward asynchronous metrics from proprietary metrics systems (predating OpenTelemetry) to various backends, such as the Elastic Stack and Splunk. In my experience (although, I acknowledge, my skills are no longer as current as they were in this area), those backends typically ingested numbers. Typically, their data models didn't distinguish between additive and non-additive numbers. It was up to you, the user, to understand the nature of your data. (Exception: metric types, such as counter and gauge, in Prometheus.) In that sense, I like the idea of a data model that makes semantic distinctions such as "additive" and "non-additive". I'm looking for compelling practical reasons to implement that semantic distinction. So far, I haven't found any. |
This is written in support of open-telemetry/opentelemetry-specification#1297.