-
Notifications
You must be signed in to change notification settings - Fork 164
Semantic conventions for telemetry pipeline monitoring #238
Conversation
|
||
The proposed metric names are: | ||
|
||
`otel.{station}.received`: Inclusive count of items entering the pipeline at a station. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies if this has been discussed before. Having wildcards in the name seems like a bad precedent to set, and makes constructing dashboards, etc. more difficult. It seems OK to have two versions of this, one for SDK and one for collector, but labels seem like a much better solution if this can actually be any string value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the semantic-conventions repo, I would list three standard station names and give guidance on how to create more station names that follow the pattern. I don't think this will happen often, but I think it should be clear that automatic pipeline monitoring will break if ever the same metric name counts items at more than one location in a pipeline.
(This uniqueness requirement is also the reason why I am not proposing to count the number of items that enter and exit each processor -- because then a chain of processors would require per-processor metrics for automatic monitoring to work. For this reason, it makes sense (as proposed here IMO) to count only the items dropped by processors.)
When there is more than one component of a given type active in a | ||
pipeline having the same `domain` and `signal` attributes, the `name` | ||
should include additional information to disambiguate the multiple | ||
instances using the syntax `<type>/<instance>`. For example, if there |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the collector, this is not necessarily enough to differentiate all components in a pipeline because we may have different components with the same type. e.g. otlp
receiver and otlp
exporter in the same pipeline.
Should this be addressed here, or if the problem is unique to the collector, should the collector adopt a strategy to incorporate the class into the type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. Since I've heavily updated this document and plan to present it in tomorrow's Collector SIG, I'm going to leave it open. As designed, it should be clear that received
metrics are from receivers and exported
metrics are from exporters. I want to say that everything else is a processor, but I realize this is a thorny question.
PipelineLossRate = LastStageTotal{success=false} / FirstStageTotal{*} | ||
``` | ||
|
||
Since total loss can be calculated with only a single timneseries per |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since total loss can be calculated with only a single timneseries per | |
Since total loss can be calculated with only a single timeseries per |
|
||
#### Practice of error suppression | ||
|
||
There is a accepted practice in the OpenTelemetry Collector of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a accepted practice in the OpenTelemetry Collector of | |
There is an accepted practice in the OpenTelemetry Collector of |
@jack-berg @carlosalberto can we mark this as triaged with priority p1? |
the pipeline should match the number of items successfully received, | ||
otherwise the system is capable of reporting the combined losses to | ||
the user. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a vital paragraph to understand why the default, different levels of detail for each station. Putting a end-to-end example, as you did with https://github.com/jmacd/oteps/blob/jmacd/drops/text/metrics/0238-pipeline-monitoring.md#pipeline-monitoring-diagram would work great here (albeit with numbers/metrics instead of charts)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To summarize my other comments and provide a suggestion:
- Per component metrics are valuable and if anything should be enhanced.
- The notion of diffing values between stations does not seem to allow sufficient accuracy in describing the way data may actually flow through a collector because it assumes linearity of data flow.
What seems like a better solution would be to add metrics which directly describe the aggregate behavior of the processors in each of the collector's pipelines.
- The total number of items flowing into the first processor.
- The total number of items added by the processors.
- The total number of items dropped by the processors.
- (Implied by 1-3) The total number of items flowing out from the last processor.
integrity of the pipeline. Stations allow data to enter a pipeline | ||
only through receiver components. Stations are never responsible for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stations allow data to enter a pipeline only through receiver components.
Generally this makes sense but I have to question whether this models the reality of what the collector may do. Specifically, the notion of aggregation comes to mind.
Suppose we have 10 data points which a processor will aggregate into 1, according to our data model. Should this be communicated as 9 data points dropped? In my opinion this is an inaccurate characterization of a valid operation. A more accurate description might require a notion of adding 1 while dropping 10.
Additionally, I think there may be some valid cases where a processor would add data into a stream. For example, here is a proposed component which would augment a data stream. Perhaps more debate is necessary there, but it seems reasonable to me that in some cases we may generate additional items and insert them into the data stream, when they are naturally complimentary.
A simpler case of adding data to a stream would involve computed metrics. Say we have a metric w/ "free" and "used" data points, and we wish to generate a "% utilized" metric. I think this could reasonably be done in a processor as well.
#### Collector perspective | ||
|
||
Collector counters are exclusive. Like for SDKs, items that enter a | ||
processor are counted in one of three ways and to compute a meaningful | ||
ratio requires all three timeseries. If the processor is a sampler, | ||
for example, the effective sampling rate is computed as | ||
`(accepted+refused)/(accepted+refused+dropped)`. | ||
|
||
While the collector defines and emits metrics sufficient for | ||
monitoring the individual pipeline component--taken as a whole, there | ||
is substantial redundancy in having so many exclusive counters. For | ||
example, when a collector pipeline features no processors, the | ||
receiver's `refused` count is expected to equal the exporter's | ||
`send_failed` count. | ||
|
||
When there are several processors, it is primarily the number of | ||
dropped items that we are interested in counting. When there are | ||
multiple sequential processors in a pipeline, however, counting the | ||
total number of items at each stage in a multi-processor pipeline | ||
leads to over-counting in aggregate. For example, if you combine | ||
`accepted` and `refused` for two adjacent processors, then remove the | ||
metric attribute which distinguishes them, the resulting sum will be | ||
twice the number of items processed by the pipeline. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion this describes too limited a view of how the collector is expected to function. I'm concerned about establishing a model which cannot describe the reality of what the collector does.
For example, when a collector pipeline features no processors, the receiver's
refused
count is expected to equal the exporter'ssend_failed
count.
Is this true when a pipeline contains multiple exporters, each of which can fail independently? Similarly, what about when a receiver is used in multiple pipelines?
For example, if you combine
accepted
andrefused
for two adjacent processors, then remove the metric attribute which distinguishes them, the resulting sum will be twice the number of items processed by the pipeline.
My understanding is that this is how our data model is intended to work. If we had a similar situation, for example "request" counts from a set of switches, we could use the same aggregation mechanism to understand the total number of requests processed by the switches, but we would not simultaneously expect this total to account for packets which flowed through multiple switches.
This is not to say that there is no way to represent net behavior across a linear sequence of processors, but I do not buy that there is a problem problem with counts for each processor.
Because of station integrity, we can make the following assertions: | ||
|
||
1. Data that enters a station is eventually exported or dropped. | ||
2. No other outcomes are possible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As noted elsewhere, I think we need to account for other possibilities:
- Duplicated data (fanned out to multiple exporters within a pipeline, or from a receiver to multiple pipelines).
- Added data.
- Aggregated data (dropped and added)
- Data which is exported successfully by a subset of exporters, and rejected by a different subset.
metric detail is configured, to avoid redundancy. For simple | ||
pipelines, the number of items exported equals the number of items | ||
received minus the number of items dropped, and for simple pipelines | ||
it is sufficient to observe only successes and failures by receiver as | ||
well as items dropped by processors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced the concept of a "simple" pipeline is worthy of a special model. In my opinion, we should find a solution which works more generally.
only through receiver components. Stations are never responsible for | ||
dropping data, because only processor components drop data. Stations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SDK processors do not drop data, and processors aren't responsible for passing data to the next process in a pipeline. Instead, higher order SDK code ensures that each registered processor is called. As the spec puts it, "Each processor registered on TracerProvider is a start of pipeline that consist of span processor and optional exporter".
The architectural differences between SDK and collector processor may impact your design. For example, further on you state:
- Data that enters a station is eventually exported or dropped.
- No other outcomes are possible.
For SDKs, its perfect valid for a processor to do nothing besides extract baggage and add it to the span. No filtering, no exporting.
instances using the syntax `<type>/<instance>`. For example, if there | ||
were two `batch` processors in a collection pipeline (e.g., one for | ||
error spans and one for non-error spans) they might use the names | ||
`batch/error` and `batch/noerror`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the SDK, if two batch processors are present, they're only differentiable by the associated exporter. You can't configure the SDK to send some spans to one batch processor, and others to another. So maybe the name would have to be something like batch/otlp
, batch/zipkin
, etc.
| `otel.name` | Type, name, or "type/name" of the component | Normal (Opt-out) | `probabilitysampler`, `batch`, `otlp/grpc` | | ||
| `otel.success` | Boolean: item considered success? | Normal (Opt-out) | `true`, `false` | | ||
| `otel.reason` | Explaination of success/failures. | Detailed (Opt-in) | `ok`, `timeout`, `permission_denied`, `resource_exhausted` | | ||
| `otel.scope` | Name of instrumentation. | Detailed (Opt-in) | `opentelemetry.io/library` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this intended to be different than the scope name? It seems like its the same. If so, then we can't opt-in or out of it. Maybe the prometheus exporter has options for that, but the scope isn't an optional part of the data model.
| Attributes | Meaning | Level of detail (Optional) | Examples | | ||
|----------------|---------------------------------------------|----------------------------|------------------------------------------------------------| | ||
| `otel.signal` | Name of the telemetry signal | Basic (Opt-out) | `traces`, `logs`, `metrics` | | ||
| `otel.name` | Type, name, or "type/name" of the component | Normal (Opt-out) | `probabilitysampler`, `batch`, `otlp/grpc` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
otel.component.name
?
|---------------------|------------------| | ||
| `otel.sdk.received` | Basic | | ||
| `otel.sdk.dropped` | Normal | | ||
| `otel.sdk.exported` | Detailed | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm struggling to see what metrics I would expect to get out of an SDK with different requirement levels. Would you be able to include an examples with a typical SDK configuration? I'm imaging an SDK with logs, metrics, and traces enabled, each configured to export data via an OTLP exporter. What metrics and series do I see at Basic, Normal, and Detailed?
way than `otel.success`, with recommended values specified below. | ||
- `otel.signal` (string): This is the name of the signal (e.g., "logs", | ||
"metrics", "traces") | ||
- `otel.name` (string): Name of the component in a pipeline. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- `otel.name` (string): Name of the component in a pipeline. | |
- `otel.component` (string): Name of the component in a pipeline. |
- `consumed`: Indicates a normal, synchronous request success case. | ||
The item was consumed by the next stage of the pipeline, which | ||
returned success. | ||
- `unsampled`: Indicates a successful drop case, due to sampling. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the Filter processor filters out data, should it count it as unsampled
? The name doesn't sound right in this case.
EDIT: How about discarded
instead of unsampled
?
|
||
- `otelsdk.producer.items`: count of successful and failed items of | ||
telemetry produced, by signal type, by an OpenTelemetry SDK. | ||
- `otelcol.receiver.items`: count of successful and failed items of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I have a connector configured, I assume I get both the otelcol.exporter.items
and otelcol.receiver.items
metrics emitted for the connector, right?
Let's say I configured the Count connector on a traces pipeline, as described in the example in the component's README. The count connector then accepts traces on the traces/in
pipeline and creates metrics on the metrics/out
pipeline.
I imagine the otelcol.exporter.items
metric for the count connector would count the incoming spans on the trace pipeline. What would be the otel.outcome
for those correctly consumed spans? Would it be consumed
or rather unsampled
? These logs aren't shipped anywhere by the component, they are "swallowed" by the connector if I understand correctly.
I imagine the otelcol.receiver.items
metric for the count connector would count the metrics created on the metrics pipeline, with the otel.outcome
set to consumed
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the consume operation synchronous? I think the traces/in pipeline will wait until the metrics/out pipeline finishes the consume operation, so the outcome for traces/in will depend on the outcome for metrics/out. If metrics/out fails w/ a retryable status, maybe the producer will retry.
Since the count connector can produce more or fewer metric data points than arriving spans, I do not expect the item counts to match between the exporter and receiver, but I think the outcomes could match for synchronous operations. If the operation is asynchronous, the rules discussed in this proposal would apply -- the traces/in might see consumed
while the metrics/out sees some sort of failure.
I don't see any problems, per se, just that the monitoring equations for connectors don't apply. I can't assume that the items_in == items_dropped + items_out.
what would ordinarily count as failure. This behavior makes automatic | ||
component health status reporting more difficult than necessary. | ||
|
||
One goal if this proposal is that Collector component health could be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One goal if this proposal is that Collector component health could be | |
One goal of this proposal is that Collector component health could be |
which adds to the confusion -- it is not standard practice for | ||
receivers to retry in the OpenTelemetry collector, that is the duty of | ||
exporters in our current practice. So, the memory limiter component, | ||
to be consistent, should count "failure drops" to indicate that the | ||
next stage of the pipeline did not see the data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends on the type of receiver. The push-based receivers respond with a retriable error code. Some pull-based receivers retry themselves, for example, the filelog receiver.
This is a work-in-progress. I will re-open it when it is ready for re-review. |
(I am incorporating all the feedback received so far. Thank you, reviewers!) |
Changes the treatment of [PartialSuccess](https://opentelemetry.io/docs/specs/otlp/#partial-success), making them successful and logging a warning instead of returning an error to the caller. These responses are meant to convey successful receipt of valid data which could not be accepted for other reasons, specifically to cover situations where the OpenTelemetry SDK and Collector have done nothing wrong, specifically to avoid retries. While the existing OTLP exporter returns a permanent error (also avoids retries), it makes the situation look like a total failure when in fact it is more nuanced. As discussed in the tracking issue, it is a lot of work to propagate these "partial" successes backwards in a pipeline, so the appropriate simple way to handle these items is to return success. In this PR, we log a warning. In a future PR, (IMO) as discussed in open-telemetry/oteps#238, we should count the spans/metrics/logs that are rejected in this way using a dedicated outcome label. **Link to tracking Issue:** Part of #9243 **Testing:** Tests for the "partial success" warning have been added. **Documentation:** PartialSuccess behavior was not documented. Given the level of detail in the README, it feels appropriate to continue not documenting, otherwise lots of new details should be added. --------- Co-authored-by: Alex Boten <[email protected]>
@kristinapathak will resume this effort, thank you! |
This is a continuation of open-telemetry/semantic-conventions#184, which developed into a longer discussion that originally planned.
This text includes results from auditing the OpenTelemetry Collector and makes recommendations consistent with existing practice that extend our ability to configure basic, normal, and detailed-level metrics about OpenTelemetry data pipelines.