Skip to content
This repository has been archived by the owner on Dec 6, 2024. It is now read-only.

WIP: pipeline monitoring otep #259

Closed

Conversation

kristinapathak
Copy link

A continuation of @jmacd's work on #238 and #249

The goal of this OTEP is to define a semantic convention for metrics that provide information on the flow of data through a pipeline, providing insights both between and within segments of telemetry pipelines.

WIP.

Main focuses currently:

  1. Measuring loss in the pipeline: providing examples that show how failures show up in the proposed metric instruments.
  2. Example scenarios and adding descriptions to diagrams.

Comment on lines +20 to +22
- `otelcol_outgoing_items`: Exported, dropped, and discarded items (Collector)
- `otelcol_incoming_items`: Received and inserted data items (Collector)
- `otelsdk_outgoing_items`: Exported, dropped, and discarded items (SDK)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty agnostic as to whether these should use periods or underscores. I kept underscores for now but please let me know if I should change them.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TylerHelmuth, you left feedback on Josh's PR about this. Please let me know your preference.


### Retries

*WIP: add details*
Copy link
Author

@kristinapathak kristinapathak May 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screenshot 2024-05-29 at 3 28 17 PM My draft on this so far - what outcomes look like when retrying three times and getting resource exhausted from the gateway for every attempt. (Red is a synchronous agent pipeline, green is async)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice if there was some kind of conservation property we could maintain in the presence of retries where we have N errors and need to a single success/failure status. It seems to me, also, something similar will be applicable for the case of fanout-consumer logic.

Should we have a separate metric for the extra fanout factor which is N-1 in both of these cases? This number will be needed somewhere to understand pipeline conservation through fanout and retries, I think.


### Recommended conventional attributes

- `otel.error` (boolean): This is true or false depending on whether the
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TylerHelmuth, you recommended adding pipeline to the otel. prefix here (ie otel.pipeline.error. Do you mean for all of the attributes below? I'm not sure how these are all attributes of an otel pipeline.

- `otel.error` (boolean): This is true or false depending on whether the
outcome is considered a failure or a success. See the chart below.
- `otel.outcome` (string): This describes the outcome in a more specific
way than `otel.error`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kristinapathak As mentioned at the SIG, I am interested in adding a similar attribute to the otelcol_exporter_send_failed_* metrics as part of open-telemetry/opentelemetry-collector#10158

I would be happy to use outcome if that is thought best.

My one concern is that I see that attribute is used on another metric we use in our org, specifically the Micrometer generated http.server.requests metric, see here for the ENUM values. I see that this is not defined for the http semconv, but I just thought it worth noting here for reference.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see how outcome will be used by a variety of metrics. My hope is that with the otel prefix (ie otel.outcome) there is no conflict with the attribute name.

Do the outcome values defined here work to solve open-telemetry/opentelemetry-collector#10157 or is more detail needed? If others are also in favor of this attribute, my hope is that you can update your PR to match this. 🙂

Copy link
Member

@djaglowski djaglowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this proposal strikes me as more intuitive than previous iterations but I still have a few questions regarding collector pipelines.

Comment on lines +47 to +48
the OpenTelemetry Collector as Collector pipelines. A Collector can contain
multiple Collector pipelines which can contain multiple segments. Each segment
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A Collector can contain multiple Collector pipelines which can contain multiple segments.

I'm unclear what this is saying. Is it saying that a single Collector pipeline contains multiple segments?

If so, how are those segments defined? For example, in the following pipeline, what are the segments?

receivers: [r1, r2]
processors: [p1, p2]
exporters: [e1, e2]

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The segments would be defined as:

  • r1, r2, p1, p2, e1
  • r1, r2, p1, p2, e2

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, each collector pipeline contains a segment per exporter, where each of these segments contains all the receivers and all the processors of the pipeline, in addition to the exporter.

Can we update the language in this section to state this more clearly? Currently, it reads as "a receiver, zero or more processors, and an exporter" which doesn't appear accurate.

Can we also state "Components can be a part of multiple segments" before describing the relationship between segments and collector pipelines, since it's a prerequisite to understanding?

Comment on lines +150 to +159
exporters. If the pipeline is synchronous, the outcome for the incoming item is
recorded based on the rules in the below order:

1. If there is a permanent error, that is used as the outcome. If there are
multiple permanent errors, choose them in the following order:
`rejected`, `deferred:rejected`, `unknown`, `deferred:unknown`.
2. If there is a transient error, that is used as the outcome. If there are
multiple transient errors, choose them in the following order:
`dropped`, `deferred:dropped`, `timeout`, `deferred:timeout`, `exhausted`,
`deferred:exhausted`, `retryable`, `deferred:retryable`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these rules are described as applying to synchronous pipelines, should they include deferred outcomes?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops that's a good point! Deferred doesn't belong here


Additional examples of these outcomes can be found in the Appendix.

### Collector Pipelines With Multiple Exporter Components
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some other arrangements of components which I'm trying to fit to the proposed model. Some of them may be worth including in this document as well. Here's how I'm understanding them:

  1. A single collector pipeline with multiple exporters. As already noted, this includes a fanout point. The document describes how to aggregate outcomes from multiple exporters, effectively ensuring that Incoming(Segment) == Outgoing(Segment).
  2. A single collector pipeline with multiple receivers. This is fairly straightforward and common. The incoming items are just summed together. Probably doesn't require a dedicated section but we could include it for symmetry.
  3. A single receiver shared by multiple collector pipelines. This is the other type of fanout point in the collector. From here there are some similar considerations to the first case, but instead of fanning out to multiple exporters, we instead fanout to entire pipelines. What is the relationship between an item arriving at such a receiver and the outcomes of passing it to multiple pipelines? For example, one pipeline may successfully export the item while the other encounters an error which propagates back to the receiver. Does this count as "origin:received" for both pipelines and then each pipeline show a different outcome?
  4. A single exporter shared by multiple collector pipelines. This is the other type of merge point in the collector. In this case, I think synchronous outcomes can resolve back to a specific pipeline, but async outcomes may not be related to any specific pipeline. For example, if 2 pipelines each send 10 items to an exporter, which then batches all 20 items together into a single export request, the outcome may be "20 deferred:rejected", but it would be incorrect for either pipeline to incorporate that count directly. Is there any way to handle this? Otherwise, maybe this is just a caveat for interpreting deferred outcomes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder about adding to alternatives considered:

The collector's logic for fanoutconsumer uses multierr, building on Go's Unwrap() []error idiom to enclose all the errors. In my opinion, it would be nice to see fanoutconsumer manage the logic of deciding how to transform multiple errors into a single error, so that we could specify that fanoutconsumer dictates the N-to-1 problem and the observability mechanism just follows whatever it decides.

Copy link
Contributor

@jmacd jmacd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thank you @kristinapathak.

While I think I would approve it if all the "WIP" sections were flushed out, but they're minor and I think this document stands in sufficient detail for an implementation to be prototyped. Probably the next step is to prototype these metrics in the collector and an SDK.

Comment on lines +347 to +348
*WIP: Figure this out. This is a bit subjective. What does an end user expect
when calculating total items dropped in failure?*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, the idea of "total" signifies starting from the beginning of the pipeline at the SDKs, meaning to use a sum of all the SDK-inserted items and compare against some point later in the pipeline to measure how many are original items are lost somehow. This could mean looking at a gateway collector's exporter counts and comparing to the SDK-inserted counts, for example.


Additional examples of these outcomes can be found in the Appendix.

### Collector Pipelines With Multiple Exporter Components
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder about adding to alternatives considered:

The collector's logic for fanoutconsumer uses multierr, building on Go's Unwrap() []error idiom to enclose all the errors. In my opinion, it would be nice to see fanoutconsumer manage the logic of deciding how to transform multiple errors into a single error, so that we could specify that fanoutconsumer dictates the N-to-1 problem and the observability mechanism just follows whatever it decides.


### Retries

*WIP: add details*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice if there was some kind of conservation property we could maintain in the presence of retries where we have N errors and need to a single success/failure status. It seems to me, also, something similar will be applicable for the case of fanout-consumer logic.

Should we have a separate metric for the extra fanout factor which is N-1 in both of these cases? This number will be needed somewhere to understand pipeline conservation through fanout and retries, I think.

codeboten added a commit to open-telemetry/opentelemetry-collector that referenced this pull request Sep 13, 2024
…ms (#11144)

This updating the existing metric points that were recently added to use
signal as an attribute instead of separating the metric name. It follows
the suggestions in [otep
259](open-telemetry/oteps#259) for the metric
and attribute names.

Putting this in draft to get some feedback from @djaglowski before
moving forward with this change

---------

Signed-off-by: Alex Boten <[email protected]>
@jmacd
Copy link
Contributor

jmacd commented Oct 10, 2024

Apologies to @kristinapathak, I think we should close this.

See open-telemetry/opentelemetry-collector#11311
open-telemetry/opentelemetry-collector#11406

@jmacd jmacd closed this Oct 10, 2024
@kristinapathak kristinapathak deleted the pipeline-monitoring branch October 10, 2024 21:41
@tsloughter
Copy link
Member

@jmacd those only cover the collector, right? Is the plan that a separate PR that focuses on the SDK metrics only be opened?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants