Skip to content

gep-4768: add TelemetryPolicy API proposal#4872

Open
gkhom wants to merge 1 commit into
kubernetes-sigs:mainfrom
gkhom:main
Open

gep-4768: add TelemetryPolicy API proposal#4872
gkhom wants to merge 1 commit into
kubernetes-sigs:mainfrom
gkhom:main

Conversation

@gkhom
Copy link
Copy Markdown
Contributor

@gkhom gkhom commented May 17, 2026

What type of PR is this?

/kind gep

What this PR does / why we need it:

Added an initial proposal for the TelemetryPolicy API (gep-4768). This includes an example, the Go structs, and comparison with prior art. (See: #4768)

Does this PR introduce a user-facing change?:

NONE

Added an initial proposal for the TelemetryPolicy API. This includes an example, the Go structs, and comparison with prior art.
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/gep PRs related to Gateway Enhancement Proposal(GEP) do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels May 17, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: gkhom
Once this PR has been reviewed and has the lgtm label, please assign youngnick for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 17, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @gkhom. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@LiorLieberman
Copy link
Copy Markdown
Member

/ok-to-test
/cc @rikatz
/cc

@k8s-ci-robot k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label May 18, 2026
@k8s-ci-robot k8s-ci-robot requested a review from rikatz May 18, 2026 14:19
@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 18, 2026
Comment thread geps/gep-4768/index.md

# 1. Tracing Configuration
tracing:
mode: "On"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the mode here is kind of a weird value. Usually we avoid to have bools or "wana be bools" (On/Off, True/False) and instead make them more explicit on Behavior (Disabled, PartiallyEnabled, FullyEnabled)

If we can only have On/Off, wouldn't it make more sense to say "if tracing is null, tracing should be disabled" instead?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(btw this applies to every "Mode" defined here)

Comment thread geps/gep-4768/index.md
# 3. Access Logs Configuration
accessLogs:
mode: "Off" # Explicitly disabled while keeping the configuration intact
matches: "response.code >= 500" # Conditional logging, CEL filtering for errors
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

honest question here: is CEL matching for logs something usually supported by OTEL exporters, or is this something specific from Istio that we are carrying?

Comment thread geps/gep-4768/index.md

The following are the Go structs modeling the proposed specification.

```Go
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
```Go
```go

the case here makes difference for markdown rendering :)

Comment thread geps/gep-4768/index.md
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`

Spec TelemetryPolicySpec `json:"spec"`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to avoid previous mistakes, can we make it explicit this is required?

Suggested change
Spec TelemetryPolicySpec `json:"spec"`
// Spec defines the desired state of TLSRoute.
// +required
Spec TelemetryPolicySpec `json:"spec"`

Copy link
Copy Markdown
Member

@LiorLieberman LiorLieberman May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you mean "the desired state of TelemetryPolicy" probably

@sjberman
Copy link
Copy Markdown
Contributor

If it helps with some of this API design, here is the ObservabilityPolicy API used by NGINX Gateway Fabric to configure the nginx-otel module. It attaches to a Route.

NGINX supports setting span names and trace context as well, which I don't see in the proposed API.

In addition, we configure the exporter settings at the Gateway level in our NginxProxy API, which includes things like service.name attribute, export interval, batch size, and batch count.

Since we already support all of these fields, the new TelemetryPolicy would need to support them as well in some form before we would be able to migrate to it.

Comment thread geps/gep-4768/index.md
Mode TelemetryMode `json:"mode,omitempty"`

// The sampling rate to apply when the parent span decision is used.
SamplingRate *Fraction `json:"samplingRate,omitempty"`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this sampling rate needed? If we're already inheriting from the parent, isn't that making the decision on whether or not to sample?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the samplingRate in this context treated as a fallback strategy when no sampling decision has been propagated from the parent?

Comment thread geps/gep-4768/index.md

const (
// CustomAttributeTypeHeader extracts the value from an HTTP header.
CustomAttributeTypeHeader CustomAttributeType = "Header"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we be explicit of what can be added or not? I would expect that we are careful with things like "Authorization" headers and other sensitive headers.

Comment thread geps/gep-4768/index.md
const (
// CustomAttributeTypeHeader extracts the value from an HTTP header.
CustomAttributeTypeHeader CustomAttributeType = "Header"
// CustomAttributeTypeMetadata extracts the value from proxy metadata or context.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, Metadata is a bit of a confusing concept in this context. I am worried we are making it very "Envoy" centric (unless @sjberman tells me he gets the concept of "metadata" here out of Nginx)

As we are leaning towards OTEL, I think it would be good to be more explicit on what each attribute means (eg.: https://opentelemetry.io/docs/specs/semconv/registry/attributes/http/ ) and use some OTEL references here

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No concept of metadata in nginx.

Comment thread geps/gep-4768/index.md
}

type TelemetryPolicySpec struct {
// Identifies the target gateways to which this policy attaches (GEP-713).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we must be specific here that at least one target is required

Comment thread geps/gep-4768/index.md
// Identifies the target gateways to which this policy attaches (GEP-713).
TargetRefs []NamespacedPolicyTargetReference `json:"targetRefs"`

// Configuration for distributed tracing options.
Copy link
Copy Markdown
Member

@rikatz rikatz May 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment must be more complete. Think on users using this API. We must write here what is tracing, what Gateway users will get from it.

This applies for every comment on every struct, I would recommend taking a look into https://gateway-api.sigs.k8s.io/reference/api-spec/main/spec/#backendtlspolicy

Each API field must contain:

  • What is the field
  • What happens when a user configure that field (or in the absence of that field)
  • Are there specific conflicting options?
  • Anything else that may help users to use

Also, given this is a field documentation for implementations, you must consider:

  • What an implementation must know when considering that field? (does setting it has any caveat? Does the implementation needs to consider some condition status?)
  • What the permutation of possible values can cause that the implementation must care? (think on conformance tests)

Think on https://gateway-api.sigs.k8s.io/guides/api-design/. The documentation must be separated between user facing and implementation facing (when writing implementation/developers facing, use the tag <gateway:util:excludeFromCRD></gateway:util:excludeFromCRD>)

Also, when defining fields you must define what kind of support you expect for it. Eg.:

  • core - means that the field is required to be recognized by the implementation for this feature to work. eg.: targetRefs
  • extended - means that the feature/field should be implemented by implementations that claim its support, but are not mandatory for the whole feature. eg.: a telemetryPolicy may support tracing, or metrics, or accessLogs, but doesn't require all of them to work
  • implementationSpecific - We don't do conformance test, and a user using it knows they will not have portability between implementations. We should avoid this kind of support, and use only if strictly required on new apis. An example is my question about the cel matching for logs, as I don't think every implementation supports it

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 defining support levels at this stage is very important. Also worth keeping in mind, everything in core and extended should be covered by conformance tests, otherwise it risks becoming implementation-specific anyway.

Comment thread geps/gep-4768/index.md
tracing:
mode: "On"
provider:
endpoint: "otel-collector.monitoring.svc:4317"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should probably use backendRef style API to align with other APIs

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread geps/gep-4768/index.md
// Mode explicitly controls if metric generation is enabled. Valid values are "On" or "Off".
// +kubebuilder:validation:Enum=On;Off
// +kubebuilder:default=On
Mode TelemetryMode `json:"mode,omitempty"`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all the modes, what happens if I do not set the field? is it explicitly OFF or can it be "default for the proxy"?

What if I do not attach a TelemetryPolicy at all?

Comment thread geps/gep-4768/index.md
// --- Metrics Types ---

type MetricsConfig struct {
// Mode explicitly controls if metric generation is enabled. Valid values are "On" or "Off".
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean to turn off metrics generation? It does not seem reasonable to entirely disable all metrics for an application

Comment thread geps/gep-4768/index.md
Mode TelemetryMode `json:"mode,omitempty"`

// List of configurations to customize specific metric families.
Overrides []MetricOverride `json:"overrides,omitempty"`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this actually do? Is it customizing an existing metric? adding a new one?

Comment thread geps/gep-4768/index.md
MetricAttributeTypeLiteral MetricAttributeType = "Literal"
)

type MetricAttribute struct {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we can add fields only right, not remove them?

Comment thread geps/gep-4768/index.md
Comment on lines +332 to +336
// A list of specific fields or headers to include in the logs.
Fields []string `json:"fields,omitempty"`

// A list of specific fields to include in the logs, specifying their source.
Fields []LogField `json:"fields,omitempty"`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dupe field?

Comment thread geps/gep-4768/index.md
// This is required if Type is "Literal".
LiteralValue *string `json:"literalValue,omitempty"`

// StandardValue specifies a standard log property (e.g., "RequestStartTime", "Duration").
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these defined? or implementation specific?

Comment thread geps/gep-4768/index.md
}
```

## Comparison with Prior Art
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API seems to basically just be directly Istio's API. It doesn't seem like we really explored prior art beyond that even though there was substantial feedback on the original agentic-networking exploring alternative approaches. Can we get some of that context here? Even if its 'ideas considered but rejected' its still useful

Comment thread geps/gep-4768/index.md
}

type TracingProvider struct {
Endpoint string `json:"endpoint,omitempty"`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about TLS options and custom headers, per #4775 (comment)

Comment thread geps/gep-4768/index.md
LogFieldTypeStandard LogFieldType = "Standard"
)

type LogField struct {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me what the fields LogField in the log represents, is it a JSON field for specifying additional key-value pairs?

How do you plan to support customization of nested JSON structures? These are commonly used in access logs, e.g., in the Airlock Microgateway we use the Elastic Common Schema (ECS).

Since access logs tend to be highly implementation-specific, IMO a format string or an extension point is probably the best approach here.

Comment thread geps/gep-4768/index.md
Mode TelemetryMode `json:"mode,omitempty"`

// The sampling rate to apply when the parent span decision is used.
SamplingRate *Fraction `json:"samplingRate,omitempty"`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the samplingRate in this context treated as a fallback strategy when no sampling decision has been propagated from the parent?

Comment thread geps/gep-4768/index.md
// Identifies the target gateways to which this policy attaches (GEP-713).
TargetRefs []NamespacedPolicyTargetReference `json:"targetRefs"`

// Configuration for distributed tracing options.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 defining support levels at this stage is very important. Also worth keeping in mind, everything in core and extended should be covered by conformance tests, otherwise it risks becoming implementation-specific anyway.

Comment thread geps/gep-4768/index.md
}
```

## Comparison with Prior Art
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also include the Airlock Microgateway Telemetry CR and the NGINX Tracing API as prior art?

Comment thread geps/gep-4768/index.md
name: my-gateway

# 1. Tracing Configuration
tracing:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should enabling of tracing imply that 'traceparent' header propagation also happens?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/gep PRs related to Gateway Enhancement Proposal(GEP) ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants