diff --git a/CHANGELOG.md b/CHANGELOG.md index b88a75ca1fb..e04b367a537 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -426,6 +426,8 @@ release. ### OTEPs +- Introduce Policies into the specification. ([#4288](https://github.com/open-telemetry/opentelemetry-specification/pull/4288)) + - Extend attributes to support complex values. ([#4485](https://github.com/open-telemetry/opentelemetry-specification/pull/4485)) diff --git a/oteps/4738-telemetry-policy.md b/oteps/4738-telemetry-policy.md new file mode 100644 index 00000000000..ac438a78ed0 --- /dev/null +++ b/oteps/4738-telemetry-policy.md @@ -0,0 +1,1024 @@ +# Telemetry Policies + +Defines a new concept for OpenTelemetry: Telemetry Policy. + +## Motivation + +OpenTelemetry provides declarative configuration, OpAMP for remote control, +cross-language SDK extension points, and the OpenTelemetry Collector for +telemetry processing. Controlling telemetry behavior at scale remains difficult. +The current model—configuration files that define processing pipelines—breaks +down in predictable ways. + +### Configurations grow organically + +Processing rules accumulate over time. A configuration that started as 50 lines +becomes thousands. Each line represents a problem that was solved, but the +context is lost. Line 847 exists for a reason, but that reason is not documented +in the configuration itself. + +Changes become risky because the configuration encodes institutional knowledge +that is not legible to new team members or external tools. + +### Configurations require global reasoning + +To safely change one part, you need to understand the whole. Data flows through +a DAG of components—what shape is it at this point? What came before? What +breaks if you modify this? The cognitive load grows with the config size until +changes become risky and reviews become superficial. + +When using OpAMP with an OpenTelemetry Collector, the controlling server needs +to understand the configuration layout of that specific collector. If a user +asks the server to "filter out all attributes starting with `x.`", the server +must understand and parse the collector configuration. If the same server also +manages an OpenTelemetry SDK, it needs a second implementation of the attribute +filtering feature—one for the SDK, one for the Collector. Each component has its +own configuration format and semantics. + +### Configurations don't scale + +Dropping a hundred noisy log patterns is feasible. A thousand patterns degrades +performance. Ten thousand is impractical. The sequential processing model was +not designed for this level of specificity. Organizations compromise by dropping +broad categories, losing signal with the noise. + +### Remote control lacks guarantees + +Existing remote-control capabilities in OpenTelemetry are not guaranteed to be +usable. The Jaeger Remote Sampler works with OpenTelemetry SDKs and the +Collector's Jaeger remote sampler extension. However, file-based configuration +does not require dynamic reloading. Neither OpAMP nor file-based configuration +mandate that a recipient apply changes dynamically — an implementation can +conform to both specifications without supporting dynamic adaptation. Without a +component that explicitly requires dynamic behavior, there is no guarantee that +remote configuration changes take effect without a full restart. + +The OpenTelemetry Collector allows custom configuration file formats. A +controlling server cannot operate with an arbitrary Collector distribution +without understanding all possible configuration formats it may encounter. + +### A different model + +These goals can be achieved without changing the direction of OpAMP or +file-based configuration. The solution is to separate "configuration" from +"policy". + +Policies are independent rules. Each policy is atomic, self-contained, and +understandable in isolation. The execution model supports tens of thousands of +policies without degradation. A policy works the same way whether it runs in an +SDK, a Collector, or any other component that implements the specification. + +## Explanation + +We define a new concept called a `Telemetry Policy`. A Policy is an intent-based +specification from a user of OpenTelemetry. + +- **Typed**: A policy self-identifies its "type" through its target signal. In + the proto schema, this is enforced by the `oneof target` field — a policy + targets exactly one signal (e.g. log, metric, profile, trace). Policies of different + types cannot be merged, but policies of the same type MUST be merged together. +- **Clearly specified behavior**: A policy type enforces a specific behavior for + a clear use case, e.g. trace sampling, metric aggregation, attribute + filtering. +- **Implementation Agnostic**: I can use the exact same policy in the collector + or an SDK or any other component supporting OpenTelemetry's ecosystem. +- **Standalone**: I don't need to understand how a pipeline is configured to + define policy. +- **Dynamic**: We expect policies to be defined and driven outside the lifecycle + of a single collector or SDK. This means the SDK behavior needs the ability to + change post-instantiation. +- **Idempotent**: I can give a policy to multiple components in a + telemetry-plane safely. E.g. if both an SDK and collector obtain an + attribute-filter policy, it would only occur once. + +Every policy is defined with the following: + +- A `type` denoting the use case for the policy +- A schema denoting what a valid definition of the policy entails, describing + how servers should present the policy to customers. +- A specification denoting behavior the policy enforces + - A specification makes clear the protobuf structure + - The behavior that is expected for an implementation + - A set of examples and test cases to verify the behavior + +Policies MUST NOT: + +- Specify configuration relating to the underlying policy applier + implementation. + - A policy cannot know where the policy is going to be run. +- Specify its transport methodology. +- Interfere with telemetry upon failure. + - Policies MUST be fail-open. +- Contain logical waterfalls. + - Each policy's application is distinct from one another and at this moment + MUST NOT depend on another running. This is in keeping with the idempotency + principle. + +Example policy types include: + +- `trace-sampling`: define how traces are sampled +- `metric-rate`: define sampling period for metrics +- `log-filter`: define how logs are sampled/filtered +- `attribute-redaction`: define attributes that need redaction/removal. +- `metric-aggregation`: define how metrics should be aggregated (i.e. views). +- `exemplar-sampling`: define how exemplars are sampled +- `attribute-filter`: define data that should be rejected based on attributes + +
+Example Policies + +**Cost Control — Drop debug logs** + +```json +{ + "id": "drop-debug-logs", + "name": "Drop debug and trace logs", + "log": { + "match": [ + { + "log_field": "severity_text", + "regex": "^(DEBUG|TRACE)$" + } + ], + "keep": "none" + } +} +``` + +**PCI Compliance — Redact credit card numbers** + +```json +{ + "id": "redact-ccs", + "name": "Redact credit card numbers", + "log": { + "match": [ + { + "log_attribute": ["ccn"], + "exists": true + } + ], + "transform": { + "redact": [ + { + "log_attribute": ["ccn"] + } + ] + } + } +} +``` + +**Trace Sampling — Sample database spans at 5%** + +```json +{ + "id": "sample-database-spans-5-percent", + "name": "Sample database spans at 5%", + "description": "Aggressively samples database spans which are typically high volume. Uses equalizing mode to balance sampling across different query types.", + "trace": { + "match": [ + { + "span_attribute": ["db.system"], + "exists": true + } + ], + "keep": { + "percentage": 5.0, + "mode": "equalizing", + "sampling_precision": 6 + } + } +} +``` + +
+ +## Policy Ecosystem + +Policies are designed to be straightforward objects with little to no logic tied +to them. Policies are also designed to be agnostic to the transport, +implementation, and data type. It is the goal of the ecosystem to support +policies in various ways. Policies MUST be additive and MUST NOT break existing +standards. It is therefore our goal to extend the ecosystem by recommending +implementations through the following architecture. + +The architectural decisions are meant to be flexible to allow users optionality +in their infrastructure. For example, a user may decide to run a multi-stage +policy architecture where the SDK, daemon collector, and gateway collector work +in tandem where the SDK and Daemons are given set policies while the gateway is +remotely managed. Another user may choose to solely remotely manage their SDKs. +As a result of this scalable architecture, it's recommended that policy provider +updates are asynchronous. An out-of-date policy (i.e. one updated in a policy +provider but not yet in the applier) should not be lethal to the functionality +of the system. + +
+Architecture Diagram + +```mermaid +--- +title: Policy Architecture +--- +flowchart TB + subgraph providers ["Policy Providers"] + direction TB + PP["«interface» Policy Provider"] + + File["File Provider"] + HTTP["HTTP Server Provider"] + OpAMP["OpAMP Server Provider"] + Custom["Custom Provider"] + + PP -.->|implements| File + PP -.->|implements| HTTP + PP -.->|implements| OpAMP + PP -.->|implements| Custom + end + + subgraph aggregator ["Policy Aggregator"] + PA["Policy Aggregator (Special Provider)"] + end + + subgraph implementation ["Policy Implementation"] + PI["Policy Implementation"] + PT["Supported Policy Types"] + PI --- PT + end + + subgraph policies ["Policies"] + P1["Policy 1"] + P2["Policy 2"] + P3["Policy N..."] + end + + %% Provider relationships + PP -.->|implements| PA + + %% Aggregator pulls from providers + File -->|policies| PA + HTTP -->|policies| PA + OpAMP -->|policies| PA + Custom -->|policies| PA + + %% Providers supply policies to implementation + File -->|supplies policies| PI + HTTP -->|supplies policies| PI + OpAMP -->|supplies policies| PI + Custom -->|supplies policies| PI + PA -->|supplies policies| PI + + %% Policies relationship + PP -->|provides| policies + PI -->|runs| policies + + %% Optional type info + PP -.->|"may supply supported policy types (optional)"| PI +``` + +
+ +### Example Ecosystem Implementations + +The following observations and recommendations describe how the community may +integrate with this specification. + +#### OpenTelemetry SDKs + +An SDK's declarative configuration may be extended to support a list of policy +providers. An SDK with no policy providers configured behaves the same as +today—policies are fail-open. The simplest policy provider is the file provider. +The SDK reads this file at startup and optionally watches for changes. + +Policy providers push policies into the SDK, allowing the SDK to become a policy +implementation. An SDK may receive updates at any time, so it must support +reloading in its extension points. Sample SDK extension points: + +- `PolicySampler`: Pulls relevant `trace-sampling` policies from PolicyProvider. +- `PolicyLogProcessor`: Pulls relevant `log-filter` policies from + PolicyProvider. +- `PolicyPeriodicMetricReader`: Pulls relevant `metric-rate` policies from + PolicyProvider. + +#### OpenTelemetry Collector + +The Collector is a natural place to run policies. A policy processor may be +introduced to execute policies. The Collector should use the same declarative +configuration as the SDK for policy provider configuration. The Collector may +introduce an inline policy provider for default policies in addition to those +received from external providers. + +The Collector may also serve as a policy aggregator through a policy extension. +The extension pulls from multiple policy providers while other policy +implementations set the Collector as their policy provider. This pattern enables +a horizontally scalable architecture where all extensions eventually report the +same policies. + +#### OpAMP + +This specification makes no requirements on the transport layer for policy +providers. OpAMP may serve as a policy provider through custom messages. A +policy implementation with OpAMP support may use the OpAMP connection to +transport policies. This specification makes no recommendation on the custom +message format. + +#### Summary + +This specification makes no requirements on these groups. It is recommended that +they adhere to a consistent experience for users to enhance portability. +Coordination with other SIGs will ensure agreement on configuration. A follow-up +specification may recommend policy provider specifics such as an HTTP/gRPC +definition, which would serve as a basis for custom implementations like OpAMP. +See `Future Possibilities` for more. + +## Internal details + +### Typed Schema + +Below is a sample for the schema of a policy, defined in the protobuf format. We +make an effort to adhere to OpenTelemetry Semantic Conventions and previous +specifications. Note: these proto definitions are subject to changes after this +OTEP is accepted. + +```proto +message Policy { + // Unique identifier for this policy + string id = 1; + + // Human-readable name + string name = 2; + + // Optional description + string description = 3; + + // Whether this policy is enabled + bool enabled = 4; + + // Timestamp when this policy was created (Unix epoch nanoseconds) + fixed64 created_at_unix_nano = 5; + + // Timestamp when this policy was last modified (Unix epoch nanoseconds) + fixed64 modified_at_unix_nano = 6; + + // Labels for metadata and routing + repeated opentelemetry.proto.common.v1.KeyValue labels = 7; + + // Target configuration. Exactly one must be set. + oneof target { + LogTarget log = 10; + MetricTarget metric = 11; + TraceTarget trace = 12; + ... + } +} +``` + +Every policy MUST have an ID and name. Each policy MAY specify associated labels +and metadata about its creation. Each policy MUST specify only one target +configuration to promote specificity for users when creating a policy. + +
+Target Proto Definitions + +**LogTarget** + +```proto +message LogTarget { + // At least one matcher is required + repeated LogMatcher match = 1; + + // Keep behavior: "all" (default), "none", or a sampling percentage + string keep = 2; + + // Optional transformations applied after keep + LogTransform transform = 3; +} + +message LogTransform { + repeated LogRemove remove = 1; + repeated LogRedact redact = 2; + repeated LogRename rename = 3; + repeated LogAdd add = 4; +} + +message LogRemove { + oneof field { + LogField log_field = 1; + AttributePath log_attribute = 2; + AttributePath resource_attribute = 3; + AttributePath scope_attribute = 4; + } +} + +message LogRedact { + oneof field { + LogField log_field = 1; + AttributePath log_attribute = 2; + AttributePath resource_attribute = 3; + AttributePath scope_attribute = 4; + } + string replacement = 10; // defaults to "[REDACTED]" +} + +message LogRename { + oneof from { + LogField log_field = 1; + AttributePath log_attribute = 2; + AttributePath resource_attribute = 3; + AttributePath scope_attribute = 4; + } + string to = 10; + bool upsert = 11; +} + +message LogAdd { + oneof field { + LogField log_field = 1; + AttributePath log_attribute = 2; + AttributePath resource_attribute = 3; + AttributePath scope_attribute = 4; + } + string value = 10; + bool upsert = 11; +} +``` + +**MetricTarget** + +```proto +message MetricTarget { + // At least one matcher is required + repeated MetricMatcher match = 1; + + // Whether to keep matching metrics + bool keep = 2; +} + +message MetricMatcher { + oneof field { + MetricField metric_field = 1; + AttributePath datapoint_attribute = 2; + AttributePath resource_attribute = 3; + AttributePath scope_attribute = 4; + MetricType metric_type = 5; + } + + oneof match { + string exact = 10; + string regex = 11; + bool exists = 12; + string starts_with = 13; + string ends_with = 14; + string contains = 15; + } + + bool negate = 20; + bool case_insensitive = 21; +} +``` + +**TraceTarget** + +```proto +message TraceTarget { + // At least one matcher is required + repeated TraceMatcher match = 1; + + // Probabilistic sampling configuration + TraceSamplingConfig keep = 2; +} + +message TraceMatcher { + oneof field { + TraceField trace_field = 1; + AttributePath span_attribute = 2; + AttributePath resource_attribute = 3; + AttributePath scope_attribute = 4; + SpanKind span_kind = 5; + SpanStatusCode span_status = 6; + string event_name = 7; + AttributePath event_attribute = 8; + string link_trace_id = 9; + } + + oneof match { + string exact = 10; + string regex = 11; + bool exists = 12; + string starts_with = 13; + string ends_with = 14; + string contains = 15; + } + + bool negate = 20; + bool case_insensitive = 21; +} + +message TraceSamplingConfig { + float percentage = 1; // 0-100 + string mode = 2; // "hash_seed", "proportional", or "equalizing" + int32 sampling_precision = 3; // hex digits for threshold encoding (1-14) + int32 hash_seed = 4; // hash seed for deterministic sampling + bool fail_closed = 5; // reject items on sampling errors +} +``` + +
+Throughout the schema, we take advantage of `oneof` to prevent invalid +configuration (i.e. someone specifying type: trace and then a metric-only +configuration). + +#### Policy Matchers + +To optimize the performance of policies and adhere to the above requirements for +policies, each policy target configuration begins with setting a list of ANDed +matchers. The `LogMatcher` configuration below allows a user to easily target a +log or group of logs through any fields available to the log. A policy MUST +contain at least one matcher. Regular expressions MUST use RE2 syntax for +cross-implementation consistency. + +```proto +message LogMatcher { + // The field to match against. Exactly one must be set. + oneof field { + // Simple fields (body, severity_text, trace_id, span_id, etc.) + LogField log_field = 1; + + // Log record attribute by key or path + AttributePath log_attribute = 2; + + // Resource attribute by key or path + AttributePath resource_attribute = 3; + + // Scope attribute by key or path + AttributePath scope_attribute = 4; + } + + // Match type. Exactly one must be set. + oneof match { + // Exact string match + string exact = 10; + + // Regular expression match + string regex = 11; + + ... + } + + // If true, inverts the match result + bool negate = 20; + + // If true, applies case-insensitive matching to all match types + bool case_insensitive = 21; +} +``` + +### Policy Design + +Policies are not a general-purpose language that implementations interpret +dynamically. Each policy stage is a concrete, versioned capability that +implementations must explicitly support. This means new stages require +implementation updates — an implementation cannot execute a stage it does not +understand. + +#### Current Stages + +The specification currently defines two stages, executed in fixed order: + +1. **Keep** — Determines whether telemetry is retained, sampled, or dropped. All + matching policies contribute their `keep` values and the runtime applies the + most restrictive result. If telemetry is dropped or sampled out, processing + stops. Keep is supported for all signal types (logs, metrics, traces). Trace + keep supports probabilistic sampling with configurable modes (hash_seed, + proportional, equalizing) and W3C tracestate propagation. + +2. **Transform** — Modifies telemetry that survives the keep stage. Operations + execute in a fixed order: remove → redact → rename → add. Currently + transforms are defined for logs only. Within each operation type, if multiple + policies target the same field, the result is implementation-defined but MUST + be deterministic. + +#### Adding New Stages + +New policy stages (e.g., metric renaming, metric aggregation, span rollups) will +follow the same process as the current stages: defined in the specification, +validated through the conformance suite, and implemented across language +libraries. Because policies are not a general-purpose language, each new stage +requires: + +- A specification update defining the stage's schema, behavior, and merging + semantics. +- Conformance tests covering the new stage's behavior. +- Implementation updates in each language library. + +Implementations MAY support a subset of stages but MUST clearly document which +stages are unsupported. An implementation that encounters a policy with an +unsupported stage MUST follow fail-open behavior — the policy is skipped, not +the telemetry. + +This approach trades the flexibility of a general-purpose language for +predictability. Every stage has well-defined semantics, every implementation +agrees on behavior, and the conformance suite guarantees consistency. A new +stage ships when the specification, tests, and at least one implementation are +ready. + +### Runtime Requirements + +#### Evaluation + +Implementations MAY evaluate policies concurrently. The independence of policies +enables parallel matching without coordination. + +#### Error Handling + +Implementations MUST be fail-open: + +- If a policy fails to parse, it MUST be skipped. Other policies MUST continue + to execute. +- If a policy fails to evaluate (e.g., invalid regular expression at runtime), + the telemetry MUST pass through unmodified by that policy. +- Policy failures MUST NOT cause telemetry loss. + +Implementations SHOULD log policy evaluation errors for debugging. + +#### Disabled Policies + +Policies with `enabled: false` MUST NOT be evaluated. Implementations MUST treat +disabled policies as if they do not exist. + +### Merging policies + +Policy merging has two distinct concerns: how a provider **transports** policy +updates to a client, and how a runtime **resolves** overlapping policies at +evaluation time. We address each in turn. + +#### Transport-level sync + +Since the policy itself does not enforce a transport mechanism or format, the +sync mechanism is also not enforced by the policy. However, all transport +implementations SHOULD follow these principles: + +**Prefer full-set replacement over patching.** Transmitting the complete policy +set on each sync avoids traditional merge pitfalls — field ordering ambiguity, +partial update conflicts, and array operation incompatibilities. A provider +SHOULD send the full list of active policies and the client SHOULD atomically +replace its local set. Implementations SHOULD support a hash or version +identifier for change detection so that clients can skip processing when the +policy set has not changed. + +**Support incremental updates as an optimization, not a requirement.** Transport +protocols MAY support incremental diffs (e.g., add/remove individual policies by +ID) as a bandwidth optimization. When incremental updates are supported, the +protocol MUST also provide a mechanism for the client to request a full sync to +recover from drift or missed updates. + +**Report policy status back to the provider.** Transport protocols SHOULD +provide a mechanism for clients to report per-policy status (match counts, +errors) back to the provider. This feedback loop enables providers to detect +misconfigured or ineffective policies. Status SHOULD be scoped to each provider +— a provider only receives status for the policies it supplies. Each provider is +responsible for ensuring its policies are not disruptive to the system. + +**Resolve duplicate policy IDs by provider priority.** When multiple providers +supply a policy with the same ID, the client must decide which one to keep. +Implementations SHOULD assign each provider a priority — for example, OpAMP (1), +HTTP (2), FILE (3), CUSTOM (user-defined) — where a lower number is higher +priority. When two policies share the same ID, the policy from the +higher-priority provider wins and the other is dropped. Where a policy from a +lower-priority provider cannot be merged consistently with the higher-priority +version, the lower-priority policy SHOULD be dropped in its entirety. + +The specific mechanism will depend on the `PolicyProvider` implementation: + +- A `FileProvider` reads the full policy set from disk (YAML, JSON, or proto + binary). Each read produces a complete snapshot; no patch semantics are + needed. +- An HTTP or gRPC provider SHOULD implement request/response sync with + hash-based change detection and support for client metadata (supported policy + stages, resource attributes). +- OpAMP providers can embed the policy set in an OpAMP custom-message or + agent-config payload, reusing OpAMP's existing change-detection mechanisms. + +#### Runtime conflict resolution + +Because policies are independent and self-contained, multiple policies may match +the same piece of telemetry. When this happens, the runtime must combine their +effects. Regardless of how an implementation structures its evaluation, the +following properties MUST hold: + +- **Commutativity.** The result of applying a set of matching policies MUST NOT + depend on the order in which they are processed. +- **Idempotency.** Applying the same policy twice MUST produce the same result + as applying it once. +- **Determinism.** Given the same set of matching policies and the same + telemetry, every instance MUST produce the same output. + +These properties ensure that policies can be distributed across agents and +collectors without coordination, and that the outcome is reproducible regardless +of processing order. + +As a concrete example, consider how a runtime might resolve conflicting `keep` +values. A naïve approach — last write wins — violates commutativity: + +```python +# Bad: result depends on processing order +def resolve_keep_naive(matching_policies): + result = "all" + for policy in matching_policies: + result = policy.keep # last one wins + return result +``` + +Instead, the runtime can apply a **commutative reduction** that always converges +to the same answer. For `keep`, a natural choice is "most restrictive wins": + +```python +def restrictiveness(keep): + """Returns a numeric rank for a keep value. Lower = more restrictive. + + Ranking: + none → 0 (drop everything) + N/s → 1 (N per second, rate limited) + N/m → 2 (N per minute, rate limited) + N% → 3 (percentage sampling, ordered by percentage ascending) + all → 4 (keep everything) + """ + if keep.value == "none": + return (0,) + if keep.unit == "per_second": + return (1, keep.amount) + if keep.unit == "per_minute": + return (2, keep.amount) + if keep.unit == "percent": + return (3, keep.percentage) + if keep.value == "all": + return (4,) + +def most_restrictive(a, b): + """Commutative merge: returns whichever keep value is more restrictive.""" + return a if restrictiveness(a) <= restrictiveness(b) else b + +def resolve_keep(matching_policies): + """Resolve conflicting keep values across all matching policies. + + The result is independent of policy ordering because most_restrictive + is commutative: most_restrictive(a, b) == most_restrictive(b, a). + """ + result = Keep("all") + for policy in matching_policies: + if policy.keep is None: + continue + result = most_restrictive(result, Keep(policy.keep)) + return result +``` + +The same principle extends to any policy field where multiple policies may +contribute values. For each such field, the implementation should define a +commutative merge operation — for example, taking the minimum, taking the union, +or applying a deterministic priority order. Where no natural commutative +operation exists (e.g., two policies set different values for the same +attribute), implementations MUST process policies in a consistent order (e.g., +alphanumerically by policy ID) to ensure reproducible results across instances. + +## Trade-offs and mitigations + +This specification makes deliberate trade-offs in favor of simplicity and scale. + +**No user-defined ordering.** You cannot specify that policy A runs before +policy B. This is intentional—ordering creates dependencies, and dependencies +break the independence that makes policies scale. The trade-off is less +flexibility. If you need strict ordering, you need separate processing stages +outside the policy system. + +**No conditional logic.** Policies don't support if/else or branching. Each +policy is a simple predicate and action. Complex conditional logic belongs in +your application code, not your telemetry processing. This keeps policies easy +to understand and easy to generate. + +**No cross-policy references.** A policy cannot reference another policy's +output or depend on another policy having run. This limits composition but +ensures every policy is self-contained, allowing a user to run a policy anywhere +and verify its correctness. You can reason about each policy in isolation. + +These constraints exist because the primary goal is scale—tens of thousands of +policies executing efficiently. Every feature that adds complexity makes that +goal harder. The spec intentionally stays minimal. + +## Prior art and alternatives + +This section examines existing approaches to telemetry processing and control, +analyzing their strengths and limitations relative to the policy model proposed +here. + +### Pipeline Configurations + +Pipeline-based configurations are the dominant model for telemetry processing. +Tools like Vector, Fluent Bit, Logstash, and the OpenTelemetry Collector define +processing as a directed acyclic graph (DAG) of components. Data flows through +receivers, processors, and exporters in a defined sequence, with each component +transforming the data before passing it to the next. + +**Pros:** + +- Expressive and flexible: arbitrary transformations are possible at each stage. +- Well-understood model with extensive tooling and community knowledge. +- Supports complex routing, fan-out, and conditional logic. +- Mature implementations with production-proven reliability. + +**Cons:** + +- Requires global reasoning: changing one component may affect downstream + behavior. +- Configuration complexity grows with rule count; thousands of rules become + unmanageable. +- Sequential execution creates performance bottlenecks as rules multiply. +- Not portable: each tool has its own configuration format and semantics. +- Interdependencies make it difficult to effectively remotely modify policies. + +### OPA (Open Policy Agent) + +OPA provides a general-purpose policy engine using the Rego query language. +Originally designed for authorization and admission control in cloud native +environments, OPA can evaluate arbitrary policies against structured data. It is +widely used in Kubernetes admission control, API authorization, and +infrastructure policy enforcement. + +**Pros:** + +- Turing-complete policy language enables complex conditional logic. +- Decouples policy from enforcement: policies are data, not code. +- Strong ecosystem with tooling for testing, debugging, and distribution. +- Supports policy bundles for centralized management. + +**Cons:** + +- Rego has a steep learning curve; it is not intuitive for most engineers. +- General-purpose design means no telemetry-specific optimizations. +- Policies can have arbitrary logic, making behavior harder to predict. +- Evaluation overhead may be prohibitive for high-throughput telemetry streams. + +### Datadog Processing Pipelines (prior art) + +Datadog provides a UI-driven approach to log processing. Users define pipelines +containing processors that parse, enrich, filter, and transform logs. Each +processor has a filter (matcher) and an action. The UI abstracts the underlying +configuration, making it accessible to non-engineers. + +**Pros:** + +- User-friendly interface lowers the barrier to creating processing rules. +- Each processor is conceptually similar to a policy: matcher plus action. +- Integrated with Datadog's broader observability platform. +- Managed service eliminates operational burden. + +**Cons:** + +- Vendor lock-in: rules are specific to Datadog and not portable. +- Limited to Datadog's supported transformations and matchers. +- No programmatic API for bulk rule management at scale. +- Opaque execution model makes debugging difficult. + +### OpenTelemetry Collector Processors / OTTL + +The OpenTelemetry Collector includes processors for common telemetry +transformations: `filter` for dropping data, `attributes` for modifying +attributes, `transform` for OTTL-based transformations, and others. These are +configured in YAML as part of the collector pipeline. + +**Pros:** + +- Native to the OpenTelemetry ecosystem with strong community support. +- OTTL (OpenTelemetry Transformation Language) provides a structured + transformation syntax. +- Processors are composable within the pipeline model. +- Open source with transparent behavior. + +**Cons:** + +- Rules are embedded in pipeline configuration, not standalone. +- Adding rules requires understanding the full pipeline context. +- Not portable to SDKs or other runtimes without another implementation. +- No native support for dynamic updates without configuration reload. +- Scale is limited by the sequential processing model. +- No defined grammar for OTTL, making it impossible to run outside the + collector. + +### Declarative Config + OpAMP as sole control for telemetry + +Declarative config + OpAMP could be used to send any config to any component in +OpenTelemetry. Here, we would leverage OpAMP configuration passing and the +open-extension and definitions of Declarative Config to pass the whole behavior +of an SDK or Collector from an OpAMP "controlling server" down to a component +and have them dynamically reload behavior. + +What this solution doesn't do is answer how to understand what config can be +sent to what component, and how to drive control / policy independent of +implementation or pipeline set-up. For example, imagine a simple collector +configuration: + +```yaml +receivers: + otlp: + prometheus: + # ... config ... +processors: + batch: + memorylimiter: + transform/drop_attribute: + # config to drop an attribute +exporters: + otlp: +pipelines: + metrics/critical: + receivers: [otlp] + processors: [batch, transform/drop_attribute] + exporters: [otlp] + metrics/all: + receivers: [prometheus] + processors: [memorylimiter] + exporters: [otlp] +``` + +Here, we have two pipelines with intended purposes and tuned configurations. One +which will _not_ drop metrics when memory limits are reached and another that +will. Now - if we want to drop a particular metric from being reported, which +pipeline do we modify? Should we construct a new processor for that purpose? +Should we always do so? + +Now imagine we _also_ have an SDK we're controlling with declarative config. If +we want to control metric inclusion in that SDK, we'd need to generate a +completely different looking configuration file, as follows: + +```yaml +file_format: "1.0-rc.1" +# ... other config ... +meter_provider: + readers: + - my_custom_metric_filtering_reader: + my_filter_config: # defines what to filter + wrapped: + periodic: + exporter: + otlp_http: + endpoint: ${OTEL_EXPORTER_OTLP_ENDPOINT:-http://localhost:4318}/v1/metric +``` + +Here, I've created a custom component in Java to allow filtering which metrics +are read. However, to insert / use this component I need to have all of the +following: + +- Know that this component exists in the Java SDK +- Know how to wire it into any existing metric export pipeline (e.g. my reader + wraps another reader that has the real export config). Note: This likely means + I need to understand the rest of the exporter configuration or be able to + parse it. + +This is not ideal for a few reasons: + +- Anyone designing a server that can control telemetry flow MUST have a deep + understanding of all components it could control and their implementations. +- We don't have a "safe" mechanism to declare what configuration is supported or + could be sent to a specific component (note: we can design one) +- The level of control we'd expose from our telemetry systems is _expansive_ and + possibly dangerous. + - We cannot limit the impact of any remote configuration on the working of a + system. We cannot prevent changes that may take down a process. + - We cannot limit the execution overhead of configuration or fine-grained + control over what changes would be allowed remotely. + +### Summary + +This specification draws from all of these approaches but prioritizes +independence, portability, and scale over flexibility. Where pipeline +configurations offer maximum expressiveness, policies offer predictability. +Where OPA provides a general-purpose language, policies provide a minimal, +purpose-built model. Where vendor solutions lock users in, policies use +OpenTelemetry's data model for portability. + +## Open questions + +What are some questions that you know aren't resolved yet by the OTEP? These may +be questions that could be answered through further discussion, implementation +experiments, or anything else that the future may bring. + +## Prototypes + +- [usetero/policy](https://github.com/usetero/policy) + - The policy specification, defining the schema, matching behavior, merging + semantics, and conformance requirements. +- [usetero/policy-go](https://github.com/usetero/policy-go) + - Go implementation of the policy specification, designed for integration with + the OpenTelemetry Collector and other Go-based telemetry components. +- [usetero/policy-rs](https://github.com/usetero/policy-rs) + - Rust implementation of the policy specification, leveraging Hyperscan for + high-performance regular expression matching. +- [usetero/policy-zig](https://github.com/usetero/policy-zig) + - Zig implementation of the policy specification, targeting zero heap + allocations on the hot path and maximum portability. +- [usetero/policy-conformance](https://github.com/usetero/policy-conformance) + - Cross-language conformance test suite with 160+ tests covering filtering, + sampling, transformations, and consistent behavior across implementations. + +## Future possibilities + +What are some future changes that this proposal would enable?