-
Notifications
You must be signed in to change notification settings - Fork 991
[OTEP] Telemetry Policy #4738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
lmolkova
merged 30 commits into
open-telemetry:main
from
jsuereth:wip-telemetry-policy-otep
Jun 30, 2026
+1,029
−1
Merged
[OTEP] Telemetry Policy #4738
Changes from 4 commits
Commits
Show all changes
30 commits
Select commit
Hold shift + click to select a range
e70377e
Create OTEP based on kubecon discussions on policy control.
jsuereth da34640
Update oteps/9999-telemetry-policy.md
jsuereth 6396605
Add more justification.
jsuereth 4a092aa
Rename to PR number.
jsuereth 4063e20
update with more details from kubecon
jaronoff97 1e8697b
Add merging spec
menderico 26645d4
Add some todos
jaronoff97 92383dd
Add post-merge conflict resolution and example
menderico 930a151
update w/ 80 line breaks, better organization, tradeoffs, etc.
jaronoff97 87eec2d
update with more work
jaronoff97 26a98ea
chlog entry
jaronoff97 cc255e6
fix chlog
jaronoff97 40b05b3
Merge branch 'main' into wip-telemetry-policy-otep
jaronoff97 6c3e8de
update policy spec from feedback
jaronoff97 1b88a16
Merge branch 'main' into wip-telemetry-policy-otep
jaronoff97 10c4d96
spcheck
jaronoff97 79d5956
Merge branch 'wip-telemetry-policy-otep' of github.com:jsuereth/opent…
jaronoff97 715b1f6
Update oteps/4738-telemetry-policy.md
jaronoff97 f7d013e
Update oteps/4738-telemetry-policy.md
jaronoff97 6cf0565
Merge branch 'main' into wip-telemetry-policy-otep
jaronoff97 2897f8b
Merge branch 'main' into wip-telemetry-policy-otep
jaronoff97 53c8a3b
Merge branch 'main' into wip-telemetry-policy-otep
jaronoff97 88fb961
Merge branch 'main' into wip-telemetry-policy-otep
jaronoff97 d6bbdb6
update minor language
jaronoff97 3603e24
fix lint
jaronoff97 d6618ee
better language
jaronoff97 8eee91c
Merge branch 'main' into wip-telemetry-policy-otep
jsuereth a41ed6b
fix some links
jaronoff97 428e34f
Merge branch 'wip-telemetry-policy-otep' of github.com:jsuereth/opent…
jaronoff97 2553612
Merge branch 'main' into wip-telemetry-policy-otep
jsuereth File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,230 @@ | ||
| # Telemetry Policies | ||
|
|
||
| Defines a new concept for OpenTelemetry: Telemetry Policy. | ||
|
|
||
| ## Motivation | ||
|
|
||
| OpenTelemetry provides a robust, standards based instrumentation solution. | ||
| this includes many great components, e.g. | ||
|
|
||
| - Declarative configuration | ||
| - Control Protocol via OpAMP | ||
| - X-language extension points in the SDK (samplers, processors, views) | ||
| - Telemetry-Plane controls via the OpenTelemetry collector. | ||
|
|
||
| However, OpenTelemetry still struggles to provide true "remote control" | ||
| capabilities that are implementation agnostic. When using OpAMP with an | ||
| OpenTelemetry collector, the "controlling server" of OpAMP needs to understand | ||
| the configuraiton layout of an OpenTelemetry collector. If a user asked the | ||
| server to "filter out all attributes starting with `x.`", the server would | ||
| need to understand/parse the OpenTelemetry collector configuration. If the | ||
| controlling sever was also managing an OpenTelemetry SDK, then it would need | ||
| a *second* implementation of the 'filter attribute" feature for the SDK vs. | ||
| the Collector. Additionally, as the OpenTelemetry collector allows custom | ||
| configuration file formats, there is no way for a "controlling server" to | ||
| operate with an OpenTelemetry Collection distribution without understanding all | ||
| possible implementations it may need to talk to. | ||
|
|
||
| Additionally, existing remote-control capabilities in OpenTelemetry are not | ||
| "guaranteed" to be usable due to specification language. For example, today | ||
| one can use the Jaeger Remote Sampler specified for OpenTelemetry SDKs and the | ||
| jaeger remote sampler extension in the OpenTelemetry collector to dynamically | ||
| control the sampling of spans in SDKs. However, File-based configuration does | ||
|
jsuereth marked this conversation as resolved.
Outdated
|
||
| not require dynamic reloading of configuration. This means attempting to | ||
| provide a solution like Jaeger-remote-sampler with just OpAMP + file-based | ||
| config is impossible, today. | ||
|
jsuereth marked this conversation as resolved.
Outdated
jsuereth marked this conversation as resolved.
Outdated
|
||
|
|
||
| However, we believe there is a way to acheive our goals without changing | ||
| the direction of OpAmp or File-based configuration. Instead we can break apart | ||
|
jsuereth marked this conversation as resolved.
Outdated
|
||
| the notion of "Configuration" from "Policy", providing a new capability in | ||
| OpenTelemetry. | ||
|
|
||
| ## Explanation | ||
|
|
||
| We define a new concept called a `Telemetry Policy`. A Policy is an | ||
| intent-based specification from a user of OpenTelemetry. | ||
|
|
||
| - **Typed**: A policy self identifies its "type". Policies of different types | ||
| cannot be merged, but policies of the same type MUST be merged together. | ||
| - **Clearly specified behavior**: A policy type enforces a specific behavior for | ||
| a clear use case, e.g. trace sampling, metric aggregation, attribute | ||
| filtering. | ||
| - **Implementation Agnostic**: I can use the exact same policy in the collector | ||
| or an SDK or any other component supporting OpenTelemetry's ecosystem. | ||
| - **Standalone**: I don't need to understand how a pipeline is configured to define | ||
| policy. | ||
| - **Dynamic**: We expect policies to be defined and driven outside the lifecycle | ||
| of a single collector or SDK. This means the SDK behavior needs the ability | ||
|
jsuereth marked this conversation as resolved.
Outdated
|
||
| to change post-instantiation. | ||
| - **Idempotnent**: I can give a policy to multiple components in a | ||
| telemetry-plane safely. E.g. if both an SDK and collector obtain an | ||
| attribute-filter policy, it would only occur once. | ||
|
jsuereth marked this conversation as resolved.
|
||
|
|
||
| Every policy is defined with the following: | ||
|
jsuereth marked this conversation as resolved.
|
||
|
|
||
| - A `type` denoting the use case for the policy | ||
| - A json schema denoting what a valid definitin of the policy entails. | ||
|
jsuereth marked this conversation as resolved.
Outdated
|
||
| - TODO - A merge algorithm, denoting how multiple policies can be merged | ||
| together in a component to create desired behavior. | ||
| - TODO - A specification denoting the behavior the policy enforces. | ||
| - TODO - *implicily* a policy has a target resource / signal it is aimed at. | ||
| This will be used to route policies to destinations. | ||
|
|
||
| Example policy types include: | ||
|
dashpole marked this conversation as resolved.
Outdated
|
||
| - `trace-sampling`: define how traces are sampled | ||
|
jsuereth marked this conversation as resolved.
|
||
| - `metric-rate`: define sampling period for metrics | ||
| - `log-filter`: define how logs are sampled/filtered | ||
| - `attribute-redaction`: define attributes which need redaction/removal. | ||
| - `metric-aggregation`: define how metrics should be aggregated (i.e. views). | ||
| - `exemplar-sampling`: define how exemplars are sampled | ||
|
jsuereth marked this conversation as resolved.
|
||
|
|
||
| TODO - more examples? | ||
|
|
||
| TODO - Remaining high level pieces: | ||
|
|
||
| - SDK Components | ||
| - `PolicyProvider` | ||
| - Can "push" policies into the provider. | ||
| - Provides "observable" access to policies (e.g. notify on change) | ||
|
jsuereth marked this conversation as resolved.
Outdated
|
||
| - Extension Points | ||
| - `PolicySampler`: Pulls relevant `trace-sampling` policies from | ||
| PolicyProvider, and uses them. | ||
| - `PolicyLogProcessor`: Pulls Relevant `log-filter` policies from | ||
| PolicyProvider and uses them. | ||
| - `PolicyPeriodicMetricReader`: Pulls Relevant `metric-rate` policies | ||
| from PolicyProvider and uses them to export metrics. | ||
| - TODO: SDK-wide attribute processors | ||
| - TODO: SDK-view policies | ||
|
jsuereth marked this conversation as resolved.
Outdated
|
||
| - Collector Components | ||
| - `PolicyProcessor` | ||
| - Pulls configured policies that can be enforced as a processor. | ||
| - E.g. `log-filter`, `attribute-redaction` | ||
| - TODO - others? | ||
| - OpAmp Interaction | ||
| - Policy = custom extension | ||
| - Can we safely "roll back" a policy if it caused a breakage? | ||
|
jsuereth marked this conversation as resolved.
Outdated
|
||
| - Confguration Interaction: We always expect "policy-aware" components to be configured, policies are ignorant of pipelines. | ||
|
|
||
|
|
||
| ## Internal details | ||
|
|
||
| TDOO - write | ||
|
|
||
| From a technical perspective, how do you propose accomplishing the proposal? In particular, please explain: | ||
|
|
||
| * How the change would impact and interact with existing functionality | ||
| * Likely error modes (and how to handle them) | ||
| * Corner cases (and how to handle them) | ||
|
|
||
| While you do not need to prescribe a particular implementation - indeed, OTEPs should be about **behaviour**, not implementation! - it may be useful to provide at least one suggestion as to how the proposal *could* be implemented. This helps reassure reviewers that implementation is at least possible, and often helps them inspire them to think more deeply about trade-offs, alternatives, etc. | ||
|
|
||
| ## Trade-offs and mitigations | ||
|
|
||
| TODO - write | ||
|
|
||
| What are some (known!) drawbacks? What are some ways that they might be mitigated? | ||
|
|
||
| Note that mitigations do not need to be complete *solutions*, and that they do not need to be accomplished directly through your proposal. A suggested mitigation may even warrant its own OTEP! | ||
|
|
||
| ## Prior art and alternatives | ||
|
dashpole marked this conversation as resolved.
|
||
|
|
||
| TODO - discuss https://github.com/open-telemetry/opentelemetry-specification/pull/4672 | ||
|
|
||
| ### Declarative Config + OpAMP as sole control for telemetry | ||
|
|
||
| The declarative config + OpAMP could be used to send any config to any | ||
| component in OpenTelemetry. Here, we would leverage OpAMP configuration passing | ||
| and the open-extension and definitions of Declarative Config to pass the whole | ||
| behavior of an SDK or Collector from an OpAMP "controlling server" down to a | ||
| component and have them dynamically reload behavior. | ||
|
|
||
| What this solution doesn't do is answer how to understand what config can be | ||
| sent to what component, and how to drive control / policy independent of | ||
| implementation or pipeline set-up. For example, imagine a simple collector | ||
| configuration: | ||
|
|
||
| ```yaml | ||
| recievers: | ||
| otlp: | ||
| prometheus: | ||
| # ... config ... | ||
| processors: | ||
| batch: | ||
| memorylimiter: | ||
| transform/drop_attribute: | ||
| # config to drop an attribute | ||
| exporters: | ||
| otlp: | ||
| pipelines: | ||
| metrics/crtical: | ||
| receivers: [otlp] | ||
| processors: [batch, transform/drop_attribute] | ||
| exporters: [otlp] | ||
| metrics/all: | ||
| receivers: [prometheus] | ||
| processors: [memorylimiter] | ||
| exporters: [otlp] | ||
| ``` | ||
|
|
||
| Here, we have two pipelines with intended purposes and tuned configurations. | ||
| One which will *not* drop metrics when memory limits are reached and another | ||
| that will. Now - if we want to drop a particular metric from being reported, | ||
| which pipeline do we modify? Should we construct a new processor for that | ||
| purpose? Should we always do so? | ||
|
|
||
|
jsuereth marked this conversation as resolved.
|
||
| Now imagine we *also* have an SDK we're controlling with declarative config. If | ||
| we want to control metric inclusion in that SDK, we'd need to generate a | ||
| completely different looking configuration file, as follows: | ||
|
|
||
| ```yaml | ||
| file_format: '1.0-rc.1' | ||
| # ... other config ... | ||
| meter_provider: | ||
| readers: | ||
| - my_custom_metric_filtering_reader: | ||
| my_filter_config: # defines what to filter | ||
|
jsuereth marked this conversation as resolved.
|
||
| wrapped: | ||
| periodic: | ||
| exporter: | ||
| otlp_http: | ||
| endpoint: ${OTEL_EXPORTER_OTLP_ENDPOINT:-http://localhost:4318}/v1/metric | ||
| ``` | ||
|
|
||
| Here, I've created a custom component in java to allow filtering which metrics are read. | ||
| However, to insert / use this component I need to have all of the following: | ||
|
|
||
| - Know that this component exists in the java SDK | ||
|
jsuereth marked this conversation as resolved.
Outdated
|
||
| - Know how to wire it into any existing metric export pipeline (e.g. my reader | ||
| wraps another reader that has the real export config). | ||
| Note: This likely means I need to understand the rest of the exporter | ||
| configuration or be able to parse it. | ||
|
|
||
| This is not ideal for a few reasons: | ||
|
|
||
| - Anyone designing a server that can control telemetry flow MUST have a deep | ||
| understanding of all components it could control and their implementations. | ||
| - We don't have a "safe" mechanism to declare what configuration is supported | ||
| or could be sent to a specific component (note: we can design one) | ||
| - The level of control we'd expose from our telemetry systems is *expansive* | ||
| and possibly dangerous. | ||
| - We cannot limit the impact of any remote configuration on the working of a | ||
| system. We cannot prevent changes that may take down a process. | ||
| - We cannot limit the execution overhead of configuration or fine-grained | ||
| control over what changes would be allowed remotely. | ||
|
|
||
| ## Open questions | ||
|
|
||
| What are some questions that you know aren't resolved yet by the OTEP? These may be questions that could be answered through further discussion, implementation experiments, or anything else that the future may bring. | ||
|
|
||
| ## Prototypes | ||
|
|
||
| Link to any prototypes or proof-of-concept implementations that you have created. | ||
| This may include code, design documents, or anything else that demonstrates the | ||
| feasibility of your proposal. | ||
|
|
||
| Depending on the scope of the change, prototyping in multiple programming | ||
| languages might be required. | ||
|
|
||
| ## Future possibilities | ||
|
|
||
| What are some future changes that this proposal would enable? | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.