-
Notifications
You must be signed in to change notification settings - Fork 893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does OpenTelemetry need "enum" metric type? #1711
Comments
+1 for adding first-class support for enumerated values (or a String-valued gauge). I have some internal customers trying to model this in their applications. Currently they are setting the enumerated value as an attribute and setting the metric value to 0 or 1 depending on which enum value is active. One of the downsides of the "value-as-attribute" approach is that it bloats the volume of metrics flowing through the pipeline (in each collection interval, all but one of the metrics have a An alternative pattern would be to encode each enumerated value as a numeric integer. This has the downside that the metric is not self-describing and you'd need additional context to decode the integer value in the sink. In OpenMetrics, enums can be encoded using the more general StateSet type which uses a bit-vector to encode the values. StateSets represent a series of related boolean values, also called a bitset.
|
+1 We're developing a hardware monitoring solution in OpenTelemetry and many of the components report a status that is an enumeration (e.g. simply "OK", "Degraded", or "Failed"). Using the "value-as-attribute" approach is non-sense when the enumeration has many possible values. A simple implementation would be to map each string value to an integer value and describe this mapping in the unit in a standard way. Example: This way the current specification and implementation of metrics don't need to change much and it's the UI's responsibility to translate the integer value into the corresponding string value (very much like it's already the UI's responsibility to convert bytes to GB to display something that the end user can read easily). |
Keep in mind that the unit field is limited to 63 characters which is extremely limiting for an enum like this. Also, this not-quite-json would require a custom parser if a backend wanted to display it more clearly and a more machine readable true JSON format would be more annoying for users of systems that don't do any parsing. If we don't want to have a proper enum type, I would argue that a simple counter/gauge/etc with attribute |
I agree that the Using an Attribute is not ideal either. With a long Enum (say like the very common OperatingStatus property of the CIM_ManagedSystemElement WBEM class), you would need to send:
...just to say that the managed system element is "Servicing". In terms of storage, memory and network, it's very inefficient. Also, it allows invalid states where several values are "enabled" at once, while only one is possible at the same time (the service can't be both "Starting" and "Stopping" at the same time. DMTF's CIM model has had ValueMap from the beginning (see page 89 - ValueMap in CIM specifications). Heck, SNMP has had enumeration types since for more than a couple decades. See RFC-2579 which defines textual conventions syntax:
My conclusion is:
Note: In the meantime, assuming OpenTelemetry adds an Enum value type in the future, I would recommend that developers use a single Gauge int metric to represent mutually exclusive enums, as it will be much easier to convert later a single int metric to an Enum. |
Thanks for elaborating on this. The example you have given above feels a bit more like an Event to me (implying a state transition). Maybe there is a entirely different way of modeling this that does not necessarily involve metrics if performance is critical? Some considerations regarding using a gauge:
I am not sure I understand this. Are you suggesting that the int-to-string mapping information needs to be transferred with the enum data type during metrics message export? If this is the case, I am not sure if sending the different data points as labels would really increase the size of the message a lot, since we transfer all of the strings anyway (just already separated instead of as a description that needs to be parsed by the backend), and I think adding a few more integers would not really increase the relative size of the message a lot. Also: can the enum definition change over time? What happens if different services report different mappings - which one is given priority? |
Sorry for the late reply, I did not receive the notification... Thank you for the questions and comments!
You're right, metrics or events can be used to report "discrete" values (the status of a component for example), they both have pros and cons. In the end, you still need a proper Enum value to represent the current status (metric) or the new status (event).
Grafana Dashboards already has a "State timeline" panel designed just for that. The mapping in this case needs to be specified in the UI. If the metric itself could indicate the mapping in the future, that would make everybody's life much easier. The timeseries database storing the metrics doesn't need to handle Enum metrics as a specific type. They can process them just as simple integers. The value mapping could be stored as metadata. Spatial aggregation should prefer max() and min() functions, rather than average().
True. To avoid this, an enum type could be defined as a subtype of Gauge, like EnumGauge. The only difference between EnumGauge and a simple Gauge is the value mapping (instead of the Unit field). Not sure this is feasible in Go. Also, breaking data continuity doesn't really happen in OTLP itself, but rather in the exporters. If the Prometheus exporter handled EnumGauge just like Gauge (as numbers), it wouldn't make any difference in Prometheus (no break in data continuity). The same cannot be said if we have Attributes describing the states. They end up as different labels in Prometheus and thus different timeseries.
I'm simply saying that it's the exporter's responsibility to adapt the OpenTelemetry metrics to the receiving end. In the example of Prometheus, which doesn't support Enum values, the exporter would simply export the int value, and maybe the mapping in the metric description (in If a timeseries backend supports Enum values natively in the future, the corresponding exporter may have the option to export OTLP's EnumGauge as the corresponding native Enum.
Good question! The value mapping of an Enum must not change over time, very much like the unit of a metric is not supposed to change over time. EnumGauge DataPoints can be send in bulk, without resending the value mapping (just like we don't need to resend the Gauge unit for each DataPoint). |
As a matter of fact (if I'm not mistaken 😅), Grafana doesn't have any visualization that can display StateSet metrics, where the different states are represented with an Attribute (label). See this question on StackOverflow links to a 3rd-party plugin to do that (but it's not made by Grafana). I haven't seen anything that could handle this on Datadog either. |
Thanks for your reply. I think I see your point about the enum-as-gauge (and please correct me if I am wrong): It seems that using attributes could lead to (potentially unwanted) StateSet semantics, and it leads to having to export zeros for states that the device is currently not in (at least the ones that were already seen before). Using the Gauge is an efficient, but much less user friendly alternative. Also, using a Gauge currently does not allow adding the mapping on the client, since there is no way of transmitting that mapping. One would have to add it on the server side manually (as is the case in Grafana, if I understand correctly). Am I missing something here? The upside of using attributes would be that you could easily sum the counts from different producers and see stats like "how many disks are in a running state, how many disks fail on average per hour?". This is possible today and would not require any changes to the protocol. I think I am more familiar with this representation, because it better resembles my understanding of metrics (being able to "do math" with them), while this is not the case for the gauge representation. Either way, I don't really see a way around breaking semantic conventions when introducing an Enum type later. If I am not mistaken, there is no inheritance in Protobuf, so in OTLP the Enum would have to be a separate type. That means that unless the consumer decides to explicitly convert OTLP Enums to whatever internal representation they used before, there will be two separate data streams. I see your point about Prometheus in that regard, and that it should be possible to do that easily there, but I am not sure about a lot of the other consumers. |
I understand your point regarding the ability to do math with the metrics, where the states described with an attribute is much more user friendly. Also at this point, I think it is unlikely we will see timeseries server vendors embrace a new Enum type. Even though I think I have a ton of excellent reasons to want this EnumGauge, I'll have to cave in and do with what we have now. I'll update the hardware-metrics semantic convention to reflect that. Thank you for the discussion! |
There is a lengthy discussion about this topic in #2518 (comment). In that comment, I propose that OTel is better off having string-valued attributes and 0/1-valued UpDownCounters to convey state sets. With more specification work, it should be possible to avoid transmitting zeros for inactive states. For example, could we define a a data point flag set on the 1-valued attribute set that marks prior values for the same attribute keys as stale? |
I agree with @jmacd here, +1 for string-valued attributes on counters. |
@jmacd if you think that is the right call, let's create a PR to add this to the specification, also we may need to update the openmetrics compatibility (which seem to suggest already what you proposed https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/datamodel.md#stateset) |
We've ran into this issue ourselves and I actually have a question here to the folks in the thread on how to properly map a 4-state enum into a counter. Consider a "worker" abstraction that executes "tasks". We want to keep track of how many tasks there are and be able to see how many exist in each state, as well as how many there are for a given "task name". Say there are 4 states:
We'd have a
This would "signal" that we now have one (new) task, called "task1", with the status My question now is what happens with state transitions... say we now want to move that task, from
"1 for However, this is not making much sense to me. If I report transitions this way, wouldn't I still have "1 task1 instance in Waiting state", from the previous measurement, when summing the counter values? Say, at this exact point in time, I want to query: "how many tasks of type "task1" in Wouldn't it make more sense to represent that "a task left
By decrementing with the This would of course require the use of a Am I missing something obvious? It would make sense to me if the |
Rephrasing my comment open-telemetry/opentelemetry-collector-contrib#27617 (comment) here One other example on why the "encoding enum as numeric value" formulation is problematic... Many metrics backends will store pre-aggregated time-based rollups to reduce storage and increase performance for queries over longer time intervals. The "enum-as-numeric-value" formulation completely breaks semantics for these rollups. Consider these two scenarios of an hour's worth of Formulation 1 - encoding phase as metric value:
Formulation 2 - encoding phase as dimension:
With
At query time, most frontends will apply an
This gives the false impression that the pod was in Contrast this with the suggested "value-as-attribute" patterh
With this encoding, the query on the rolled-up data with
|
@sirianni The trouble with using the value-as-attribute model is that it causes excessive numbers of time series to be unnecessarily maintained for what is really one datapoint. For 5 states, you're sending 4 unnecessary datapoints constantly, and usually repeating all their common attributes for each datapoint. This is a huge waste of bandwidth and storage especially if one state is the norm most of the time. For example if I want a metric for postgres To avoid having to send all the useless 0s the metric emitter must know the scrape interval and keep track of recent states, so it can send 0-valued tombstone datapoints for recent-past states for at least a couple of scrape intervals or until the expiry of the target tsdb's staleness threshold. That's ugly to push down onto every metric emitter. What is IMO needed here is the concept of a non-cardinal attribute - a data-only attribute that is ignored for purposes of determining time-series uniqueness. So in your example above, if
and the tsdb would know that all This is a bit like a database "covering index" where additional attributes are encoded in the index storage, but not usable for search criteria. Either this, or true string-valued datapoints are IMO badly needed. It's amazing how difficult it is to represent simple enumerations of states in metrics. |
The approach used to represent enum metric here is similar to one discussed in [1, 2]. It allows to support a new status later, if required. For example, one can visualize it with hack like [3]. 1. prometheus/client_python#416 2. open-telemetry/opentelemetry-specification#1711 3. https://stackoverflow.com/a/75761900/11646599 Part of tarantool/grafana-dashboard#224
The approach used to represent enum metric here is similar to one discussed in [1, 2]. It allows to support a new status later, if required. For example, one can visualize it with hack like [3]. 1. prometheus/client_python#416 2. open-telemetry/opentelemetry-specification#1711 3. https://stackoverflow.com/a/75761900/11646599 Part of tarantool/grafana-dashboard#224
The approach used to represent enum metric here is similar to one discussed in [1, 2]. It allows to support a new status later, if required. For example, one can visualize it with hack like [3]. 1. prometheus/client_python#416 2. open-telemetry/opentelemetry-specification#1711 3. https://stackoverflow.com/a/75761900/11646599 Part of tarantool/grafana-dashboard#224
The approach used to represent enum metric here is similar to one discussed in [1, 2]. It allows to support a new status later, if required. For example, one can visualize it with hack like [3]. 1. prometheus/client_python#416 2. open-telemetry/opentelemetry-specification#1711 3. https://stackoverflow.com/a/75761900/11646599 Part of tarantool/grafana-dashboard#224
The approach used to represent enum metric here is similar to one discussed in [1, 2]. It allows to support a new status later, if required. For example, one can visualize it with hack like [3]. 1. prometheus/client_python#416 2. open-telemetry/opentelemetry-specification#1711 3. https://stackoverflow.com/a/75761900/11646599 Part of tarantool/grafana-dashboard#224
The approach used to represent enum metric here is similar to one discussed in [1, 2]. It allows to support a new status later, if required. For example, one can visualize it with hack like [3]. 1. prometheus/client_python#416 2. open-telemetry/opentelemetry-specification#1711 3. https://stackoverflow.com/a/75761900/11646599 Part of tarantool/grafana-dashboard#224
The approach used to represent enum metric here is similar to one discussed in [1, 2]. It allows to support a new status later, if required. For example, one can visualize it with hack like [3]. 1. prometheus/client_python#416 2. open-telemetry/opentelemetry-specification#1711 3. https://stackoverflow.com/a/75761900/11646599 Part of tarantool/grafana-dashboard#224
possibly related https://github.com/open-telemetry/semantic-conventions/blob/main/docs/hardware/common.md#metric-hwstatus:
|
Even though StateSet is defined in OpenMetrics, none of the Prometheus SDKs (at least that I'm aware of) implements it. Prometheus itself is not capable of parsing StateSet, even if content-type negotiation results in OpenMetrics format. If I had to guess a reason for this, it's because there's little motivation for it besides saying "We implement all OpenMetrics features". In Prometheus, Gauges can easily represent what StateSets are meant for and they are represented similarly to this comment: #1711 (comment) I can definitely agree it's not the most efficient way to handle this use case, Prometheus is open for discussions prometheus/prometheus#9139 :) |
How are users expected to encode 'enumerated' values in OpenTelemetry? Could be a gauge w/ string value?
We need to figure this out.
The text was updated successfully, but these errors were encountered: