gep-4768: add TelemetryPolicy API proposal#4872
Conversation
Added an initial proposal for the TelemetryPolicy API. This includes an example, the Go structs, and comparison with prior art.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: gkhom The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @gkhom. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/ok-to-test |
|
|
||
| # 1. Tracing Configuration | ||
| tracing: | ||
| mode: "On" |
There was a problem hiding this comment.
the mode here is kind of a weird value. Usually we avoid to have bools or "wana be bools" (On/Off, True/False) and instead make them more explicit on Behavior (Disabled, PartiallyEnabled, FullyEnabled)
If we can only have On/Off, wouldn't it make more sense to say "if tracing is null, tracing should be disabled" instead?
There was a problem hiding this comment.
(btw this applies to every "Mode" defined here)
| # 3. Access Logs Configuration | ||
| accessLogs: | ||
| mode: "Off" # Explicitly disabled while keeping the configuration intact | ||
| matches: "response.code >= 500" # Conditional logging, CEL filtering for errors |
There was a problem hiding this comment.
honest question here: is CEL matching for logs something usually supported by OTEL exporters, or is this something specific from Istio that we are carrying?
|
|
||
| The following are the Go structs modeling the proposed specification. | ||
|
|
||
| ```Go |
There was a problem hiding this comment.
| ```Go | |
| ```go |
the case here makes difference for markdown rendering :)
| metav1.TypeMeta `json:",inline"` | ||
| metav1.ObjectMeta `json:"metadata,omitempty"` | ||
|
|
||
| Spec TelemetryPolicySpec `json:"spec"` |
There was a problem hiding this comment.
to avoid previous mistakes, can we make it explicit this is required?
| Spec TelemetryPolicySpec `json:"spec"` | |
| // Spec defines the desired state of TLSRoute. | |
| // +required | |
| Spec TelemetryPolicySpec `json:"spec"` |
There was a problem hiding this comment.
nit: you mean "the desired state of TelemetryPolicy" probably
|
If it helps with some of this API design, here is the ObservabilityPolicy API used by NGINX Gateway Fabric to configure the nginx-otel module. It attaches to a Route. NGINX supports setting span names and trace context as well, which I don't see in the proposed API. In addition, we configure the exporter settings at the Gateway level in our NginxProxy API, which includes things like Since we already support all of these fields, the new TelemetryPolicy would need to support them as well in some form before we would be able to migrate to it. |
| Mode TelemetryMode `json:"mode,omitempty"` | ||
|
|
||
| // The sampling rate to apply when the parent span decision is used. | ||
| SamplingRate *Fraction `json:"samplingRate,omitempty"` |
There was a problem hiding this comment.
Why is this sampling rate needed? If we're already inheriting from the parent, isn't that making the decision on whether or not to sample?
There was a problem hiding this comment.
Is the samplingRate in this context treated as a fallback strategy when no sampling decision has been propagated from the parent?
|
|
||
| const ( | ||
| // CustomAttributeTypeHeader extracts the value from an HTTP header. | ||
| CustomAttributeTypeHeader CustomAttributeType = "Header" |
There was a problem hiding this comment.
can we be explicit of what can be added or not? I would expect that we are careful with things like "Authorization" headers and other sensitive headers.
| const ( | ||
| // CustomAttributeTypeHeader extracts the value from an HTTP header. | ||
| CustomAttributeTypeHeader CustomAttributeType = "Header" | ||
| // CustomAttributeTypeMetadata extracts the value from proxy metadata or context. |
There was a problem hiding this comment.
so, Metadata is a bit of a confusing concept in this context. I am worried we are making it very "Envoy" centric (unless @sjberman tells me he gets the concept of "metadata" here out of Nginx)
As we are leaning towards OTEL, I think it would be good to be more explicit on what each attribute means (eg.: https://opentelemetry.io/docs/specs/semconv/registry/attributes/http/ ) and use some OTEL references here
There was a problem hiding this comment.
No concept of metadata in nginx.
| } | ||
|
|
||
| type TelemetryPolicySpec struct { | ||
| // Identifies the target gateways to which this policy attaches (GEP-713). |
There was a problem hiding this comment.
we must be specific here that at least one target is required
| // Identifies the target gateways to which this policy attaches (GEP-713). | ||
| TargetRefs []NamespacedPolicyTargetReference `json:"targetRefs"` | ||
|
|
||
| // Configuration for distributed tracing options. |
There was a problem hiding this comment.
The comment must be more complete. Think on users using this API. We must write here what is tracing, what Gateway users will get from it.
This applies for every comment on every struct, I would recommend taking a look into https://gateway-api.sigs.k8s.io/reference/api-spec/main/spec/#backendtlspolicy
Each API field must contain:
- What is the field
- What happens when a user configure that field (or in the absence of that field)
- Are there specific conflicting options?
- Anything else that may help users to use
Also, given this is a field documentation for implementations, you must consider:
- What an implementation must know when considering that field? (does setting it has any caveat? Does the implementation needs to consider some condition status?)
- What the permutation of possible values can cause that the implementation must care? (think on conformance tests)
Think on https://gateway-api.sigs.k8s.io/guides/api-design/. The documentation must be separated between user facing and implementation facing (when writing implementation/developers facing, use the tag <gateway:util:excludeFromCRD></gateway:util:excludeFromCRD>)
Also, when defining fields you must define what kind of support you expect for it. Eg.:
- core - means that the field is required to be recognized by the implementation for this feature to work. eg.: targetRefs
- extended - means that the feature/field should be implemented by implementations that claim its support, but are not mandatory for the whole feature. eg.: a telemetryPolicy may support tracing, or metrics, or accessLogs, but doesn't require all of them to work
- implementationSpecific - We don't do conformance test, and a user using it knows they will not have portability between implementations. We should avoid this kind of support, and use only if strictly required on new apis. An example is my question about the cel matching for logs, as I don't think every implementation supports it
There was a problem hiding this comment.
+1 defining support levels at this stage is very important. Also worth keeping in mind, everything in core and extended should be covered by conformance tests, otherwise it risks becoming implementation-specific anyway.
| tracing: | ||
| mode: "On" | ||
| provider: | ||
| endpoint: "otel-collector.monitoring.svc:4317" |
There was a problem hiding this comment.
should probably use backendRef style API to align with other APIs
There was a problem hiding this comment.
+1 otherwise we have to discuss again how to specify TLS, ... (as also mentioned in https://github.com/kubernetes-sigs/gateway-api/pull/4872/changes/BASE..084523a5d6aa69293ca1b9577fabdab8487ba0da#r3260666205)
| // Mode explicitly controls if metric generation is enabled. Valid values are "On" or "Off". | ||
| // +kubebuilder:validation:Enum=On;Off | ||
| // +kubebuilder:default=On | ||
| Mode TelemetryMode `json:"mode,omitempty"` |
There was a problem hiding this comment.
For all the modes, what happens if I do not set the field? is it explicitly OFF or can it be "default for the proxy"?
What if I do not attach a TelemetryPolicy at all?
| // --- Metrics Types --- | ||
|
|
||
| type MetricsConfig struct { | ||
| // Mode explicitly controls if metric generation is enabled. Valid values are "On" or "Off". |
There was a problem hiding this comment.
What does it mean to turn off metrics generation? It does not seem reasonable to entirely disable all metrics for an application
| Mode TelemetryMode `json:"mode,omitempty"` | ||
|
|
||
| // List of configurations to customize specific metric families. | ||
| Overrides []MetricOverride `json:"overrides,omitempty"` |
There was a problem hiding this comment.
What does this actually do? Is it customizing an existing metric? adding a new one?
| MetricAttributeTypeLiteral MetricAttributeType = "Literal" | ||
| ) | ||
|
|
||
| type MetricAttribute struct { |
There was a problem hiding this comment.
So we can add fields only right, not remove them?
| // A list of specific fields or headers to include in the logs. | ||
| Fields []string `json:"fields,omitempty"` | ||
|
|
||
| // A list of specific fields to include in the logs, specifying their source. | ||
| Fields []LogField `json:"fields,omitempty"` |
| // This is required if Type is "Literal". | ||
| LiteralValue *string `json:"literalValue,omitempty"` | ||
|
|
||
| // StandardValue specifies a standard log property (e.g., "RequestStartTime", "Duration"). |
There was a problem hiding this comment.
Are these defined? or implementation specific?
| } | ||
| ``` | ||
|
|
||
| ## Comparison with Prior Art |
There was a problem hiding this comment.
This API seems to basically just be directly Istio's API. It doesn't seem like we really explored prior art beyond that even though there was substantial feedback on the original agentic-networking exploring alternative approaches. Can we get some of that context here? Even if its 'ideas considered but rejected' its still useful
| } | ||
|
|
||
| type TracingProvider struct { | ||
| Endpoint string `json:"endpoint,omitempty"` |
There was a problem hiding this comment.
How about TLS options and custom headers, per #4775 (comment)
| LogFieldTypeStandard LogFieldType = "Standard" | ||
| ) | ||
|
|
||
| type LogField struct { |
There was a problem hiding this comment.
It's not clear to me what the fields LogField in the log represents, is it a JSON field for specifying additional key-value pairs?
How do you plan to support customization of nested JSON structures? These are commonly used in access logs, e.g., in the Airlock Microgateway we use the Elastic Common Schema (ECS).
Since access logs tend to be highly implementation-specific, IMO a format string or an extension point is probably the best approach here.
| Mode TelemetryMode `json:"mode,omitempty"` | ||
|
|
||
| // The sampling rate to apply when the parent span decision is used. | ||
| SamplingRate *Fraction `json:"samplingRate,omitempty"` |
There was a problem hiding this comment.
Is the samplingRate in this context treated as a fallback strategy when no sampling decision has been propagated from the parent?
| // Identifies the target gateways to which this policy attaches (GEP-713). | ||
| TargetRefs []NamespacedPolicyTargetReference `json:"targetRefs"` | ||
|
|
||
| // Configuration for distributed tracing options. |
There was a problem hiding this comment.
+1 defining support levels at this stage is very important. Also worth keeping in mind, everything in core and extended should be covered by conformance tests, otherwise it risks becoming implementation-specific anyway.
| } | ||
| ``` | ||
|
|
||
| ## Comparison with Prior Art |
There was a problem hiding this comment.
Could you also include the Airlock Microgateway Telemetry CR and the NGINX Tracing API as prior art?
| name: my-gateway | ||
|
|
||
| # 1. Tracing Configuration | ||
| tracing: |
There was a problem hiding this comment.
should enabling of tracing imply that 'traceparent' header propagation also happens?
What type of PR is this?
/kind gep
What this PR does / why we need it:
Added an initial proposal for the TelemetryPolicy API (gep-4768). This includes an example, the Go structs, and comparison with prior art. (See: #4768)
Does this PR introduce a user-facing change?: