-
Notifications
You must be signed in to change notification settings - Fork 167
Counter, UpDownCounter, and Gauge instruments compared #156
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,356 @@ | ||||||
# Counter, UpDownCounter, and Gauge instruments explained | ||||||
|
||||||
Counter and Gauge instruments are different in the ways they convey | ||||||
meaning, and they are interpreted in different ways. Attributes | ||||||
applied to metric events enable further interpretation. Because of | ||||||
their semantics, the interpretive outcome of adding an attribute for | ||||||
Counter and Gauge instruments is different. | ||||||
|
||||||
With Counter instruments, a new attribute can be introduced with | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, introduced to what? Who does the attribute belong to? The counter instrument? Do you mean something like this? A new counter attribute can be introduced with additional measurements to subdivide a variable count. |
||||||
additional measurements to subdivide a variable count. | ||||||
|
||||||
With Gauge instruments, a new attribute can be introduced with | ||||||
additional measurements to make multiple observations of a variable. | ||||||
|
||||||
The OpenTelemetry Metrics API introduces a new kind of instrument, the | ||||||
UpDownCounter, that behaves like a Counter, meaning that attributes | ||||||
subdivide the variable being counted, but their primary interpretation | ||||||
is like that of a Gauge. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. At this point I'm thinking, "ok, that's nice, but why should I care?" I would suggest starting this doc at the very top by briefly illustrating the problem that exists when you only have Counter and Gauge instruments, to help the reader see that things aren't fine the way they are, and motivate them to understand the problem and care about this solution. Actually, you might just start with the use-case that you wrote up for me in our slack chat about this.
and then show at least one scenario (doesn't have to be exhaustive) where this is not possible to do without the UpDownCounter instrument. Then you could move into the rest of the doc as written. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I find this a bit hard to understand. The OpenTelemetry Metrics API introduces a new kind of instrument, the UpDownCounter, that behaves like a Counter, Perfect, so far so good 👍 meaning that attributes subdivide the variable being counted, It starts getting a bit confusing here. Who do these attributes belong to? I assume they belong to the UpDownCounter instrument. The behaves like a Counter, meaning that attributes subdivide the variable being counted tells me that the Counter attributes subdivide the variable being counted, and since UpDownCounter behaves like a Counter, then I understand that UpDownCounter has attributes that also subdivide the variable being counted. but their primary interpretation is like that of a Gauge. Ok, here it gets confusing. the usage of "but" here makes me understand that the difference between Counter and UpDownCounter is that the primary interpretation of the latter is like that of a Gauge. What does that mean? The difference between the words Counter and UpDownCounter is "UpDown" so, is the "UpDown" what a Gauge does? 🤷 I was expecting something like this instead but the UpDownCounter is non monotonic. |
||||||
|
||||||
## Background | ||||||
|
||||||
OpenTelemetry has a founding principal that the interface (API) should | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. tiny nit: 'principal' => 'principle' |
||||||
be decoupled from the implementation (SDK), thus the Metrics project | ||||||
set out to define the meaning of metrics API events. | ||||||
|
||||||
OpenTelemetry uses the term _temporality_ to describe how Sum | ||||||
aggregations are accumulated across time, whether they are reset to | ||||||
zero with each interval (_delta_) or accumulated over a sequence of | ||||||
intervals (_cumulative_). Both forms of temporality are considered | ||||||
important, as they offer a useful tradeoff between cost and | ||||||
reliability. The data model specifies that a change of temporality | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The temporality of a sum aggregation can change? I mean, can it be delta now, cumulative later, then delta again? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If you're talking version of a service, possibly. I could code service A to use DELTAs, then decide to switch to cumulatives in version 2. Practically you shouldn't expect a service to be going back and forth between delta + cumulative points. |
||||||
does not change meaning. | ||||||
|
||||||
OpenTelemetry recognizes both synchronous and asynchronous APIs are | ||||||
useful for reporting metrics, and each has unique advantages. When | ||||||
used with Counter and UpDownCounter instruments, there is an assumed | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is being used with counter and updowncounter instruments? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I read that line to mean that the terms synchronous and asynchronous when used with Counter and UpDownCounter. As in synchronous Counter vs asynchronous Counter etc. (SumObserver as I read on Lightstep docs, also see the comparison table) |
||||||
relationship between the aggregation temporality and the choice of | ||||||
synchronous or asynchronous API. Inputs to synchronous | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is the API synchronous or asynchronous? Would it be more correct to say instruments are synchronous or asynchronous? |
||||||
(UpDown)Counter instruments are the changes of a Sum aggregation | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. thought: I find the phrase
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1. I would suggest that we avoid "incremental" since it might suggest monotonicity (or we will have to put something like "incremental/decremental"). |
||||||
(i.e., deltas). Inputs to asynchronous (UpDown)Counter instruments | ||||||
are the totals of a Sum aggregation (i.e., cumulatives). | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It might be useful to explain (or link to an explanation of) why synchronous Counters are assumed to be deltas vs why async Counters are assumed to be cumulative, to help the reader follow the reasoning of why those things are true (or common). |
||||||
|
||||||
## Glossary | ||||||
|
||||||
_Meaning_: Metrics API events have a semantic definition that dictates | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This would be a bit easier to understand if the glossary also had an entry for Metrics API events. |
||||||
the meaning of the event, in particular how to interpret the integer | ||||||
or floating point number value passed to the API. | ||||||
|
||||||
_Interpretation_: How we extract information from metrics data using | ||||||
the semantics of the API and the semantics of the OTLP data points. | ||||||
|
||||||
_Metric instrument_ is a named instrument, belonging to an | ||||||
instrumentation library, declared with one of the OpenTelemtetry | ||||||
Metrics API instruments. For the purpose of this text, it is a | ||||||
Counter, an UpDownCounter, or a Gauge. | ||||||
|
||||||
_Metric attributes_ can be applied to Metric API events, which allows | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This describes what can be done with metric attributes (can be applied to metric API events), but it does not actually define what metric attributes are. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would imagine attributes are specific to each metric so the spec wouldn't define what the attributes are. This is what is referred to as labels in the ASCII art in the spec but the agreed-upon name going forward is attributes. |
||||||
interpreting the meaning of events using different subsets of | ||||||
attribute dimensions. | ||||||
|
||||||
_Metric data stream_ is a collection of data points, written by a | ||||||
writer, having an identity that consists of the instrument's name, the | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is written by a writer redundant? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ...having an identity that consists of the instrument's name What instrument is this? How is this instrument related to the metric data stream? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Who has an identity that consists of the instrument's name, the instrumentation library, resource attributes, and metric attributes? The Metric data stream or the data points? |
||||||
instrumentation library, resource attributes, and metric attributes. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this meant to help define an interface/standard or simply saying that a data stream encompasses all of the data and the source? |
||||||
|
||||||
_Metric data points_ are the items in a stream, each has a point | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Each point has a point kind sounds a bit redundant. I think it would be better to say each point has a kind. |
||||||
kind. For the purpose of this text, the point kind is Sum or Gauge. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What does "for the purpose of this text" means here? Are the point kinds going to be different for the purpose of other texts? |
||||||
Sum points have two options: Temporality and Monotonicity. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Temporality is previously defined, but not Monotonicity. |
||||||
|
||||||
_Metric timeseries_ is the output of aggregating a stream of data | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a bit hard to understand without having previous understanding of what aggregating a stream of data is. I think this definition needs another statement. For example, let's define what is an arithmetic sum: sum: is the result of adding two summands. This can be made more clear by adding a direct description of what a sum is: sum: Is a number, the result of adding two summands. Maybe this definition can begin like this? Metric timeseries is a sequence of ... that results from aggregating a stream of data points... |
||||||
points for a specific set of resource and attribute dimensions. | ||||||
|
||||||
## Meaning and interpretation of Counter and UpDownCounter events | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't know if this is what you're going for, but if your main goal is to explain the need for UpDownCounter, it might make sense to omit it from this section and wait to present it later as a solution to the problem described in this doc. If that's the goal, it feels a bit premature to describe it here, because I as a reader don't yet know why it's necessary or useful and I get a bit lost. |
||||||
|
||||||
Counter and UpDownCounter instruments produce Sum metric data | ||||||
points that are taken to have meaning in a metric stream, independent | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What does that are taken to have meaning in a metric stream mean? |
||||||
of the aggregation temporality, as follows: | ||||||
|
||||||
- Sum points are quantities that define a rate of change with respect to time | ||||||
- Rate-of-change over time combined with a reset time may be used to derive a current total. | ||||||
|
||||||
The rate interpretation is preferred for monotonic Sum points, and the | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the 2 previous points are interpretations, right? If that is the case, referring to one as "the rate interpretation" is a confusing because both points use the word rate. Better to number these interpretations and refer to them to the first or second one. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
the current total interpretation is preferred for non-monotonic Sum | ||||||
points. Both interpretations are meaningful and useful for both kinds | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Bit confused here, first this says that one interpretation is preferred for a certain kind of sum point and the other one is preferred for the other kind of sum point. Then it says that both are "meaningful and useful" for both kinds of sum point. Then, why is one preferred over the other? 🤷 |
||||||
of Sum point. | ||||||
|
||||||
Sum points imply a linear scale of measurement. A Sum value that is | ||||||
twice the amount of another actually means twice as much of the | ||||||
variable was counted. Linear interpolation is considered to preserve | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, what do you mean with "linear interpolation is considered to preserve meaning"? Do you mean that it is possible to interpolate linearly between two sum points? |
||||||
meaning. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You might consider adding a concrete example of a real life metric to help the reader follow the precise but very technical description above (maybe even a diagram, if it lends itself to that). There are some examples I've seen in other places that you could probably just steal, e.g. https://lightstep.com/blog/opentelemetry-101-what-are-metrics/ I'm thinking that would help people like me who learn more readily through examples. Same below with Gauge events. |
||||||
|
||||||
## Meaning and interpretation of Gauge events | ||||||
|
||||||
Gauge instruments produce Gauge metric data points are taken to | ||||||
have meaning in a metric stream as follows: | ||||||
|
||||||
- Gauge point values are individual measurements captured at an instant in time | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Aren't sum points also individual measurements captured at an instant in time? |
||||||
- Gauge points record the last known value in a series of individual measurements. | ||||||
|
||||||
Note that these two statements imply different interpretation for | ||||||
synchronous and asynchronous measurements. When recording Gauge | ||||||
values through a synchronous API, the interpretation is "last known | ||||||
value", and when recording Gauge values through an asynchronous API | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure I buy this. In a synchronous Gauge, recording a value is "Here's the current value". The difference in Async vs. Sync is whether you can have the "current value" or "last known value" be sampled on-demand. i.e. I think the terminology here is being phrased from an SDK/exporter perspective vs. the instrument's perspective (which I assume is closer to a user of metrics API) |
||||||
the interpretation is "current value". | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hm, "last known value" and "current value" are very similar. It is hard to understand the difference between synchronous gauge and asynchronous gauge using this concept. |
||||||
|
||||||
The distinction between last known value (synchronous) and current | ||||||
value (asynchronous) is considered not significant in the data model. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hm, but it is significant for the reader of this document to understand the difference between synchronous gauge and asynchronous gauge... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think you're implicitly saying that when you export metrics, you're setting the Timestamp for a synchronous Gauge to "export time", whereas the actual timestamp was when the synchronous instrument sent the last piece of data. E.g. Reproting INterval
Here, the synchronous gauge is reporting its value at t1 and t2. For the collection interval t0 -> t3, we report the value recorded at t2 BUT you're saying for DELTA + CUMULATIVE it would report its timestamp at t3, whereas an asynchronous instrument would just capture a value at t3. I think this needs some kind of picture. |
||||||
Contrasting with Sum points, less can be assumed about the | ||||||
measurements. No implied linear scale of measurement, therefore: | ||||||
|
||||||
- Rate-of-change may not be well defined | ||||||
- Ratios are not necessarily meaningful | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These statements feel too strong to me. I can think of instances where gauges are very useful in context of rate, ratios, and trend modeling. However, I think it's important to say that Gauges are signed with their own time stamp. Valuable analysis likely needs some data alignment (aggregating by some method to regular time intervals) in order to make meaningful comparisons |
||||||
- Linear interpolation is not necessarily supported. | ||||||
|
||||||
## Attributes are used for interpretation | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This section header doesn't quite seem to flow naturally with the text, maybe call this section something |
||||||
|
||||||
Metric attributes enable new ways to interpret a stream of metric | ||||||
data. Metric attributes add information without changing the value of | ||||||
a metric event. Addition and removal of metric attributes can be | ||||||
accomplished safely by applying transformations that preserve meaning. | ||||||
|
||||||
Addition of attributes on a metric event can create new timeseries, by | ||||||
producting a of greater number of distinct attribute sets. However, | ||||||
the meaning in the original events is preserved in the complete set of | ||||||
timeseries. | ||||||
|
||||||
Removing attributes from metric streams without changing meaning | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd suggest moving this down below the example, and adding a sentence afterward to draw attention to the problem, e.g. "...which means applying the natural aggregation function to merge metric streams. But how do we know which is the "natural" aggregation function? For Counters the answer is always SUM (for: reason), however for Gauges it might be SUM or MEAN depending on the semantics of the values represented by the particular metric. (for: reason). Therein lies the problem." |
||||||
requires re-aggregation, in general, which means applying the natural | ||||||
aggregation function to merge metric streams. | ||||||
|
||||||
For example, any metric event with no attributes: | ||||||
|
||||||
``` | ||||||
gauge.Set(value) | ||||||
``` | ||||||
|
||||||
can be extended by a new attribute, without changing its meaning or | ||||||
altering any existing interpretation: | ||||||
|
||||||
``` | ||||||
gauge.Set(value, { 'property': this.property }) | ||||||
``` | ||||||
|
||||||
## New measurements: Counter and UpDownCounter instruments | ||||||
|
||||||
Sum points have been defined to have linear scale of measurement, | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I had some trouble deciphering the phrase "linear scale of measurement". I understand the concept, after reading the rest of the paragraph, but I wonder if there's a different way to say this, maybe by adding an explanatory aside, something like this (although it probably doesn't use the right terminology) "Sum points have been defined to have linear scale of measurement, meaning the same Sum point value could be obtained through many different combinations of metric event values. This property can also be applied in reverse, meaning that Sum points can be subdivided." |
||||||
therefore Sum points can be subdivided. A single Counter event can be | ||||||
logically replaced by multiple Counter events having an equal sum. | ||||||
This property allows the producer of metric events to introduce new | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe: "introduce new measurements" => "introduce new attributes that subdivide measurements"? (maybe that's the wrong terminology though) |
||||||
measurements, while preserving existing interpretation. | ||||||
|
||||||
For example, it is reasonable to replace a single Counter event adding `x+y`: | ||||||
|
||||||
``` | ||||||
counter.Add(x+y) | ||||||
``` | ||||||
|
||||||
with separate counter events and one additional attribute: | ||||||
|
||||||
``` | ||||||
counter.Add(x, { 'property': 'X' }) | ||||||
counter.Add(y, { 'property': 'Y' }) | ||||||
``` | ||||||
|
||||||
This property for Sum points makes it possible to configure an | ||||||
instrumentation library with or without subdivided Sums and to | ||||||
meaningfully aggregate data with a mixture of attributes. | ||||||
|
||||||
## New measurements: Gauge instruments | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nice! This section is super clear and easy to understand. nit: "New measurements" => "Adding new measurements" for clarity? Same below. |
||||||
|
||||||
Gauge instruments, unlike Counter instruments, cannot be subdivided. | ||||||
Multiple Gauge measurements cannot be meaningfully combined using | ||||||
addition. In the time dimension, Gauge instrument events are | ||||||
aggregated by taking the last value. | ||||||
|
||||||
The same aggregation can be applied when removing an attributes from | ||||||
metric streams forces reaggregation. The most current value should be | ||||||
selected. In case of identical timestamps, a random value should be | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The random value sample for time collisions is an important decision that should be clearly documented. There's a lot of behavior that surfaces when this happens that is difficult to understand. I do think it's a good solution, just needs to be clearly documented. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: |
||||||
selected to preserve the meaning of the Gauge. | ||||||
|
||||||
For example, a Gauge for expressing a vehicle's speed relative to the | ||||||
ground can be expressed either as the speed of its midpoint or by an | ||||||
independent measurement of the speed of each wheel. | ||||||
|
||||||
``` | ||||||
speedGauge.Set(vehicleSpeed) | ||||||
``` | ||||||
|
||||||
This can be replaced by one Gauge per wheel, since wheel speed and | ||||||
vehicle speed each define vehicle speed relative to the ground: | ||||||
|
||||||
``` | ||||||
for i := 0; i < 4; i++ { | ||||||
speedGauge.Set(wheelSpeed[i], { 'wheel': i }) | ||||||
} | ||||||
``` | ||||||
|
||||||
This form of Gauge rewrite is generally useful to capture additional | ||||||
measurements by creating distinct metric streams. | ||||||
|
||||||
## Meaning-preserving attribute erasure | ||||||
|
||||||
Several rules for rewriting metric events that preserve meaning have | ||||||
been shown above, focused on introducing new attributes and new | ||||||
measurements in ways do not change existing meaning or alter existing | ||||||
interpretations. | ||||||
|
||||||
Removing attributes from metric events does not, by definition, change | ||||||
their meaning, since attributes are interpreted as event selectors. | ||||||
Removing attributes from aggregated streams of OpenTelemetry Metrics | ||||||
data requires attention to the meaning being conveyed. | ||||||
|
||||||
Safe attribute erasure for OpenTelemetry Metrics streams is specified | ||||||
in a way that preserves meaning while removing only the forms of | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What if "removing the forms of interpretation that made use of the erased attribute" basically is destroying the practical value of the metric? I think this is implying that removing an attribute from a metric is an ok thing to do (it' s not). The meaning that IS preserved is relevant to reaggregation + processing, but could be disastrous to dashboards and users. |
||||||
interpretation that made use of the erased attribute. | ||||||
|
||||||
_Reaggregation_ describes the process of combining OpenTelemetry | ||||||
metric streams. For reaggregation to preserve meaning, Sum points | ||||||
must be combined by adding the inputs and Gauge points must be | ||||||
combined by selecting the last or random value. | ||||||
|
||||||
Note that erasure of attributes is defined so that it reverses the | ||||||
effect of introducing new measurements, and meaning is preserved in | ||||||
both directions. This explains the definition for default | ||||||
aggregations that should be applied when re-aggreation OpenTelemetry | ||||||
metrics streams. Sum streams are re-aggregated to preserve the | ||||||
implied rate, while Gauge points are reggregated to preserve the | ||||||
implied distribution of individual values. | ||||||
|
||||||
## Conveying meaning to the user | ||||||
|
||||||
OpenTelemetry states a requirement separating the API from the | ||||||
implementation, and to do so we have defined the meaning of metrics | ||||||
API events. To preserve meaning through stages of reaggregation, we | ||||||
have specified distinct default aggregation rules for Counter and | ||||||
Gauge streams. | ||||||
|
||||||
When attributes are used with Counter and Gauge instruments, every | ||||||
distinct combination of attribute values determines a separate | ||||||
OpenTelemetry metrics stream, and each stream conveys meaning | ||||||
independently. Because meaning is independent from the attributes | ||||||
used, the user may wish to disregard some attributes when interpreting | ||||||
a stream of metrics, restricting their attention to a subset of | ||||||
attributes. | ||||||
|
||||||
In database systems, this process is refered to as a performing a | ||||||
"Group-By", where aggregation is used to combine streams within each | ||||||
distinct set of grouped attributes. For the benefit of OpenTelemetry | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This kind of falls over for me in practice. Database Group-By is a full query language where NORMALLY you need to provide your own Aggregation semantic for every column of data or the query will fail. What we're suggesting here is instead of:
The default aggregation function is infferred by otel based on the metric type. Two things:
My $.02 here is the focus on giving users a way to solve "rewrite rules" use cases is good. Making it as easy as possible is good. If we can't explain what is and isn't safe in very simple terms, we might be in trouble. If the "meaning" we retain isn't the one users wanted to retain, then we're not really adding value. |
||||||
users, Metrics systems are encouraged to choose a a meaning-preserving | ||||||
aggregation when grouping metric streams to convey meaning to the | ||||||
user. | ||||||
|
||||||
When conveying meaning to the user by grouping and aggregating over a | ||||||
subset of attribute keys, the default aggregation selected should be | ||||||
one that preserves meaning. For monotonic Counter instruments, this | ||||||
means conveying the combined rate of each group. For UpDownCounter | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why does monotonic counter instruments turn into a rate? Did I miss an above description on how UpDown vs. Counter have different aggregation meaning? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe this description? |
||||||
instruments, this means conveying the combined total of each group. | ||||||
For Gauge instruments, this means conveying the combined distribution | ||||||
of each group. | ||||||
|
||||||
## Choice of UpDownCounter or Gauge | ||||||
|
||||||
The OpenTelemetry UpDownCounter instrument resembles the Gauge | ||||||
instrument, but streams generated from these instruments apply | ||||||
different aggregation rules by default. The choice of instrument | ||||||
should be made to ensure that the default aggregation rule preserves | ||||||
meaning, as that is the point of these definitions. | ||||||
|
||||||
Examining Gauge instruments in existing systems for anecdotal evidence | ||||||
suggests that a significant majority of Gauges should be written as | ||||||
UpDownCounters in OpenTelemetry. Examples are given below. | ||||||
|
||||||
### UpDownCounter measurements | ||||||
|
||||||
UpDownCounter instruments are used for capturing quantities, where | ||||||
typical examples include: | ||||||
|
||||||
- Queue size | ||||||
- Memory size | ||||||
- Cache size | ||||||
- Active requests | ||||||
- Live object count | ||||||
|
||||||
To test that these quantities are suitable UpDownCounter measurements, | ||||||
verify that adding two inputs together logically produces another of | ||||||
the same type and scale of measurement. A queue size plus a queue | ||||||
size yields a queue size, for example; add one count of live objects | ||||||
with another, and you have a count of live objects. By choosing the | ||||||
UpDownCounter, developers ensure that the meaning conveyed is a sum, | ||||||
which ensures the correct rate interpretation. | ||||||
|
||||||
When interpreting total sums aggregated from UpDownCounter | ||||||
instruments, it is important to consider the set of contributing | ||||||
attributes, which determine the scale of measurement. If one server | ||||||
outputs UpDownCounter data in two attribute dimensions while another | ||||||
uses three attribute diensions, the mean value is not a meaningful | ||||||
quantity. The process of correcting mixed attribute dimensions for | ||||||
cumulative sums is referred to as _dimensional alignment_. | ||||||
|
||||||
### Gauge measurements | ||||||
|
||||||
Gauge instruments are used for capturing physical measurements, | ||||||
calculated ratios, and results of function evaluation. For example: | ||||||
|
||||||
- CPU utilization | ||||||
- CPU temperature | ||||||
- Fan speed | ||||||
- Water pressure | ||||||
- Success/failure ratio | ||||||
|
||||||
To test that these are suitable Gauge measurements, verify that adding | ||||||
two inputs together does not logically produce a measurement of the | ||||||
same type. | ||||||
|
||||||
A CPU utilization plus a CPU utilization cannot meaningfully be used | ||||||
as a measure of CPU utilization, it is just the sum of two CPU | ||||||
utilizations. | ||||||
|
||||||
A fan speed plus a fan speed has the correct units (a fan speed), but | ||||||
the result is not a meaningful quantity. Two fans spinning at one | ||||||
speed is not the same as one fan spinning at twice the speed. | ||||||
|
||||||
In some of these cases, it may be logical but practically impossible | ||||||
to use one or more Counter instruments in place of Gauges. CPU | ||||||
utilization can be derived from a usage Counter. Fan speed can be | ||||||
derived from a revolution Counter. | ||||||
|
||||||
## Summary | ||||||
|
||||||
The OpenTelemetry Metrics data model supports addition and removal of | ||||||
attributes in a way that preserves meaning. This design gives | ||||||
developers the ability to introduce new attributes in a safe way. | ||||||
|
||||||
OpenTelemetry metrics developers are asked to consider whether they | ||||||
want an UpDownCounter or Gauge when making asynchronous measurements, | ||||||
and they should make this decision based on whether the default | ||||||
aggregation rule for UpDownCounter or Gauge preserves meaning. This | ||||||
decision comes down to whether attributes are meant to subdivide a Sum | ||||||
point or qualify a Gauge point. | ||||||
|
||||||
The default aggregation rules for OpenTelemetry metrics data points | ||||||
ensure that meaning is preserved when removing attributes from a | ||||||
stream of metrics data. The rules for reaggregation specify that | ||||||
attributes should be safely removed before aggregating with other | ||||||
metrics that are missing the same attributes, a process referred to as | ||||||
dimensional alignment. | ||||||
|
||||||
This design allows optional attributes to be included by the SDK in | ||||||
metric data when it is available, such as those extracted from | ||||||
TraceContext Baggage, in ways that consumers of the metrics data can | ||||||
interpret correctly. | ||||||
|
||||||
Having the ability to automatically remove attributes without changing | ||||||
the meaning of Counter, UpDownCounter, and Gauge metrics API events | ||||||
makes it possible for OpenTelemetry collectors to be configured with | ||||||
re-aggregation rules, which can be managed by users in order to limit | ||||||
collection costs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, adding an attribute to what?
There are some words here that are a bit hard to understand: "convey meaning", "interpretive outcome".
Do you mean something like this?
The meaning of adding an attribute of a Counter instrument to (something) is not the same as adding an attribute of a Gauge instrument to (something).