-
Notifications
You must be signed in to change notification settings - Fork 2.1k
RFC - Pipeline Component Telemetry #11406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
mx-psi
merged 18 commits into
open-telemetry:main
from
djaglowski:component-telemetry-rfc
Nov 27, 2024
+218
−0
Merged
Changes from all commits
Commits
Show all changes
18 commits
Select commit
Hold shift + click to select a range
639453f
RFC - Auto-instrumentation of pipeline components
djaglowski 2a486ff
Update docs/rfcs/component-universal-telemetry.md
djaglowski d350043
Feedback
djaglowski 9a2a925
Feedback
djaglowski 9d0cc6d
Broaden scope and convert to evolving consensus
djaglowski 14b0ba1
Update names to consumed and produced
djaglowski 9cc449a
Change proposed metric names to use '.' instead of '_'
djaglowski f03ce85
Separate metrics by component kind
djaglowski 3a91135
Add profiles as attribute value
djaglowski ecda695
Update docs/rfcs/component-universal-telemetry.md
djaglowski e87f245
Update docs/rfcs/component-universal-telemetry.md
djaglowski 9645cf0
Change 'otel.output.signal' to 'otel.signal.output'
djaglowski eda7699
Change 'otel.*' to 'otelcol.*'
djaglowski 5d54078
Update docs/rfcs/component-universal-telemetry.md
djaglowski 95bafe9
Change unit "items" to "item"
djaglowski a7a15e5
Add section about instrementation scope
djaglowski 02584b0
Fix markdown link check
djaglowski cb72f2a
Merge branch 'main' into component-telemetry-rfc
djaglowski File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,218 @@ | ||
| # Pipeline Component Telemetry | ||
|
|
||
| ## Motivation and Scope | ||
|
|
||
| The collector should be observable and this must naturally include observability of its pipeline components. Pipeline components | ||
| are those components of the collector which directly interact with data, specifically receivers, processors, exporters, and connectors. | ||
|
|
||
| It is understood that each _type_ (`filelog`, `batch`, etc) of component may emit telemetry describing its internal workings, | ||
| and that these internally derived signals may vary greatly based on the concerns and maturity of each component. Naturally | ||
| though, there is much we can do to normalize the telemetry emitted from and about pipeline components. | ||
|
|
||
| Two major challenges in pursuit of broadly normalized telemetry are (1) consistent attributes, and (2) automatic capture. | ||
|
|
||
| This RFC represents an evolving consensus about the desired end state of component telemetry. It does _not_ claim | ||
| to describe the final state of all component telemetry, but rather seeks to document some specific aspects. It proposes a set of | ||
| attributes which are both necessary and sufficient to identify components and their instances. It also articulates one specific | ||
| mechanism by which some telemetry can be automatically captured. Finally, it describes some specific metrics and logs which should | ||
| be automatically captured for each kind of pipeline component. | ||
|
|
||
| ## Goals | ||
|
|
||
| 1. Define attributes that are (A) specific enough to describe individual component[_instances_](https://github.com/open-telemetry/opentelemetry-collector/issues/10534) | ||
| and (B) consistent enough for correlation across signals. | ||
| 2. Articulate a mechanism which enables us to _automatically_ capture telemetry from _all pipeline components_. | ||
| 3. Define specific metrics for each kind of pipeline component. | ||
| 4. Define specific logs for all kinds of pipeline component. | ||
|
|
||
| ## Attributes | ||
|
|
||
| All signals should use the following attributes: | ||
|
|
||
| ### Receivers | ||
|
|
||
| - `otelcol.component.kind`: `receiver` | ||
| - `otelcol.component.id`: The component ID | ||
| - `otelcol.signal`: `logs`, `metrics`, `traces`, `profiles` | ||
|
|
||
| ### Processors | ||
|
|
||
| - `otelcol.component.kind`: `processor` | ||
| - `otelcol.component.id`: The component ID | ||
| - `otelcol.pipeline.id`: The pipeline ID | ||
| - `otelcol.signal`: `logs`, `metrics`, `traces`, `profiles` | ||
|
|
||
| ### Exporters | ||
|
|
||
| - `otelcol.component.kind`: `exporter` | ||
| - `otelcol.component.id`: The component ID | ||
| - `otelcol.signal`: `logs`, `metrics` `traces`, `profiles` | ||
|
|
||
| ### Connectors | ||
|
|
||
| - `otelcol.component.kind`: `connector` | ||
| - `otelcol.component.id`: The component ID | ||
| - `otelcol.signal`: `logs`, `metrics` `traces` | ||
| - `otelcol.signal.output`: `logs`, `metrics` `traces`, `profiles` | ||
|
|
||
| Note: The `otelcol.signal`, `otelcol.signal.output`, or `otelcol.pipeline.id` attributes may be omitted if the corresponding component instances | ||
| are unified by the component implementation. For example, the `otlp` receiver is a singleton, so its telemetry is not specific to a signal. | ||
|
djaglowski marked this conversation as resolved.
|
||
| Similarly, the `memory_limiter` processor is a singleton, so its telemetry is not specific to a pipeline. | ||
|
|
||
| ## Auto-Instrumentation Mechanism | ||
|
|
||
| The mechanism of telemetry capture should be _external_ to components. Specifically, we should observe telemetry at each point where a | ||
| component passes data to another component, and, at each point where a component consumes data from another component. In terms of the | ||
| component graph, every _edge_ in the graph will have two layers of instrumentation - one for the producing component and one for the | ||
| consuming component. Importantly, each layer generates telemetry ascribed to a single component instance, so by having two layers per | ||
| edge we can describe both sides of each handoff independently. | ||
|
djaglowski marked this conversation as resolved.
|
||
|
|
||
| Telemetry captured by this mechanism should be associated with an instrumentation scope corresponding to the package which implements | ||
| the mechanism. Currently, that package is `service/internal/graph`, but this may change in the future. Notably, this telemetry is not | ||
| ascribed to individual component packages, both because the instrumentation scope is intended to describe the origin of the telemetry, | ||
| and because no mechanism is presently identified which would allow us to determine the characteristics of a component-specific scope. | ||
|
|
||
| ### Instrumentation Scope | ||
|
|
||
| All telemetry described in this RFC should include a scope name which corresponds to the package which implements the telemetry. If the | ||
| package is internal, then the scope name should be that of the module which contains the package. For example, | ||
| `go.opentelemetry.io/service` should be used instead of `go.opentelemetry.io/service/internal/graph`. | ||
|
|
||
| ### Auto-Instrumented Metrics | ||
|
|
||
| There are two straightforward measurements that can be made on any pdata: | ||
|
djaglowski marked this conversation as resolved.
|
||
|
|
||
| 1. A count of "items" (spans, data points, or log records). These are low cost but broadly useful, so they should be enabled by default. | ||
| 2. A measure of size, based on [ProtoMarshaler.Sizer()](https://github.com/open-telemetry/opentelemetry-collector/blob/9907ba50df0d5853c34d2962cf21da42e15a560d/pdata/ptrace/pb.go#l11). | ||
| These may be high cost to compute, so by default they should be disabled (and not calculated). This default setting may change in the future if it is demonstrated that the cost is generally acceptable. | ||
|
|
||
| The location of these measurements can be described in terms of whether the data is "consumed" or "produced", from the perspective of the | ||
| component to which the telemetry is attributed. Metrics which contain the term "produced" describe data which is emitted from the component, | ||
| while metrics which contain the term "consumed" describe data which is received by the component. | ||
|
|
||
| For both metrics, an `outcome` attribute with possible values `success` and `failure` should be automatically recorded, corresponding to | ||
| whether or not the corresponding function call returned an error. Specifically, consumed measurements will be recorded with `outcome` as | ||
| `failure` when a call from the previous component the `ConsumeX` function returns an error, and `success` otherwise. Likewise, produced | ||
| measurements will be recorded with `outcome` as `failure` when a call to the next consumer's `ConsumeX` function returns an error, and | ||
| `success` otherwise. | ||
|
|
||
| ```yaml | ||
| otelcol.receiver.produced.items: | ||
|
djaglowski marked this conversation as resolved.
|
||
| enabled: true | ||
| description: Number of items emitted from the receiver. | ||
| unit: "{item}" | ||
| sum: | ||
| value_type: int | ||
| monotonic: true | ||
| otelcol.processor.consumed.items: | ||
| enabled: true | ||
| description: Number of items passed to the processor. | ||
| unit: "{item}" | ||
| sum: | ||
| value_type: int | ||
| monotonic: true | ||
| otelcol.processor.produced.items: | ||
|
djaglowski marked this conversation as resolved.
|
||
| enabled: true | ||
| description: Number of items emitted from the processor. | ||
| unit: "{item}" | ||
| sum: | ||
| value_type: int | ||
| monotonic: true | ||
| otelcol.connector.consumed.items: | ||
| enabled: true | ||
| description: Number of items passed to the connector. | ||
| unit: "{item}" | ||
| sum: | ||
| value_type: int | ||
| monotonic: true | ||
| otelcol.connector.produced.items: | ||
| enabled: true | ||
| description: Number of items emitted from the connector. | ||
| unit: "{item}" | ||
| sum: | ||
| value_type: int | ||
| monotonic: true | ||
| otelcol.exporter.consumed.items: | ||
| enabled: true | ||
| description: Number of items passed to the exporter. | ||
| unit: "{item}" | ||
| sum: | ||
| value_type: int | ||
| monotonic: true | ||
|
|
||
| otelcol.receiver.produced.size: | ||
| enabled: false | ||
| description: Size of items emitted from the receiver. | ||
| unit: "By" | ||
| sum: | ||
| value_type: int | ||
| monotonic: true | ||
| otelcol.processor.consumed.size: | ||
| enabled: false | ||
| description: Size of items passed to the processor. | ||
| unit: "By" | ||
| sum: | ||
| value_type: int | ||
| monotonic: true | ||
| otelcol.processor.produced.size: | ||
| enabled: false | ||
| description: Size of items emitted from the processor. | ||
| unit: "By" | ||
| sum: | ||
| value_type: int | ||
| monotonic: true | ||
| otelcol.connector.consumed.size: | ||
| enabled: false | ||
| description: Size of items passed to the connector. | ||
| unit: "By" | ||
| sum: | ||
| value_type: int | ||
| monotonic: true | ||
| otelcol.connector.produced.size: | ||
| enabled: false | ||
| description: Size of items emitted from the connector. | ||
| unit: "By" | ||
| sum: | ||
| value_type: int | ||
| monotonic: true | ||
| otelcol.exporter.consumed.size: | ||
| enabled: false | ||
| description: Size of items passed to the exporter. | ||
| unit: "By" | ||
| sum: | ||
| value_type: int | ||
| monotonic: true | ||
| ``` | ||
|
|
||
| ### Auto-Instrumented Logs | ||
|
|
||
| Metrics provide most of the observability we need but there are some gaps which logs can fill. Although metrics would describe the overall | ||
| item counts, it is helpful in some cases to record more granular events. e.g. If a produced batch of 10,000 spans results in an error, but | ||
| 100 batches of 100 spans succeed, this may be a matter of batch size that can be detected by analyzing logs, while the corresponding metric | ||
|
djaglowski marked this conversation as resolved.
|
||
| reports only that a 50% success rate is observed. | ||
|
|
||
| For security and performance reasons, it would not be appropriate to log the contents of telemetry. | ||
|
|
||
| It's very easy for logs to become too noisy. Even if errors are occurring frequently in the data pipeline, only the errors that are not | ||
| handled automatically will be of interest to most users. | ||
|
|
||
| With the above considerations, this proposal includes only that we add a DEBUG log for each individual outcome. This should be sufficient for | ||
| detailed troubleshooting but does not impact users otherwise. | ||
|
|
||
| In the future, it may be helpful to define triggers for reporting repeated failures at a higher severity level. e.g. N number of failures in | ||
| a row, or a moving average success %. For now, the criteria and necessary configurability is unclear so this is mentioned only as an example | ||
| of future possibilities. | ||
|
|
||
| ### Auto-Instrumented Spans | ||
|
|
||
| It is not clear that any spans can be captured automatically with the proposed mechanism. We have the ability to insert instrumentation both | ||
|
djaglowski marked this conversation as resolved.
|
||
| before and after processors and connectors. However, we generally cannot assume a 1:1 relationship between consumed and produced data. | ||
|
|
||
| ## Additional Context | ||
|
|
||
| This proposal pulls from a number of issues and PRs: | ||
|
|
||
| - [Demonstrate graph-based metrics](https://github.com/open-telemetry/opentelemetry-collector/pull/11311) | ||
| - [Attributes for component instancing](https://github.com/open-telemetry/opentelemetry-collector/issues/11179) | ||
| - [Simple processor metrics](https://github.com/open-telemetry/opentelemetry-collector/issues/10708) | ||
| - [Component instancing is complicated](https://github.com/open-telemetry/opentelemetry-collector/issues/10534) | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.