Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
385 changes: 384 additions & 1 deletion geps/gep-4768/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,9 +61,392 @@ To mitigate the challenge of complex merging semantics, this GEP restricts confi
- **Access Logs**: Filtering for smart logging (e.g., only log 5xx errors or high latency), multi-protocol support, and log format customization (including field selection).
- **Export Configuration**: Supporting TLS connections to telemetry collectors and the ability to inject custom headers (e.g., `Authorization`) into telemetry requests.

## Request Flow
### Request Flow

* A platform operator creates a `TelemetryPolicy` resource targeting a `Gateway`.
* The Gateway API implementation reconciles this resource and configures the underlying data plane.
* The data plane extracts the specified signals and exports them to the telemetry infrastructure.

### The `TelemetryPolicy` Specification

We propose the `TelemetryPolicy` as a direct policy attachment in the `gateway.networking.k8s.io` API group. See [GEP-713](https://gateway-api.sigs.k8s.io/geps/gep-713/#classes-of-policies) for more information on direct attachment.

The following is an example that demonstrates the structure of the `TelemetryPolicy`.

```yaml
apiVersion: agentic.networking.x-k8s.io/v1alpha1
kind: TelemetryPolicy
metadata:
name: standard-telemetry
namespace: prod-ns
spec:
# GEP-713 Attachment
targetRefs:
- group: gateway.networking.k8s.io
kind: Gateway
name: my-gateway

# 1. Tracing Configuration
tracing:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should enabling of tracing imply that 'traceparent' header propagation also happens?

mode: "On"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the mode here is kind of a weird value. Usually we avoid to have bools or "wana be bools" (On/Off, True/False) and instead make them more explicit on Behavior (Disabled, PartiallyEnabled, FullyEnabled)

If we can only have On/Off, wouldn't it make more sense to say "if tracing is null, tracing should be disabled" instead?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(btw this applies to every "Mode" defined here)

provider:
endpoint: "otel-collector.monitoring.svc:4317"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should probably use backendRef style API to align with other APIs

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

samplingRate:
numerator: 5 # Represents 5/100 (5%) because denominator defaults to 100
parentBasedSampling:
mode: "On"
samplingRate:
numerator: 50 # Represents 50/100 (50%)
customAttributes:
- name: "env"
type: Literal
literalValue: "production"
- name: "mcp_task_name"
type: Metadata
metadataKey: "my.custom.filter.mcp_task_name"

# 2. Metrics Configuration
metrics:
mode: "On"
overrides:
- name: "example.com/http/request_count"
type: Counter
attributes: # Inject custom attributes/labels
- name: "x-model-id"
type: Header
headerName: "X-Model-Id"
- name: "mcp_task_name"
type: Metadata
metadataKey: "my.custom.filter.mcp_task_name"
- name: "environment"
type: Literal
literalValue: "production"

# 3. Access Logs Configuration
accessLogs:
mode: "Off" # Explicitly disabled while keeping the configuration intact
matches: "response.code >= 500" # Conditional logging, CEL filtering for errors
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

honest question here: is CEL matching for logs something usually supported by OTEL exporters, or is this something specific from Istio that we are carrying?

fields: # Configure specific fields to include, indicating their source
- name: "start_time"
type: Standard
standardValue: "RequestStartTime"
- name: "response_code"
type: Standard
standardValue: "ResponseCode"
- name: "x-token-usage"
type: Header
headerName: "X-Token-Usage"
- name: "mcp_task_name"
type: Metadata
metadataKey: "my.custom.filter.mcp_task_name"
```

#### Detailed Resource Description

The following are the Go structs modeling the proposed specification.

```Go
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
```Go
```go

the case here makes difference for markdown rendering :)

// TelemetryPolicy defines a direct policy attachment to configure
// observability signals for Gateways.
type TelemetryPolicy struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`

Spec TelemetryPolicySpec `json:"spec"`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to avoid previous mistakes, can we make it explicit this is required?

Suggested change
Spec TelemetryPolicySpec `json:"spec"`
// Spec defines the desired state of TLSRoute.
// +required
Spec TelemetryPolicySpec `json:"spec"`

Copy link
Copy Markdown
Member

@LiorLieberman LiorLieberman May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you mean "the desired state of TelemetryPolicy" probably


// status defines the observed state of TelemetryPolicy.
// +optional
Status TelemetryPolicyStatus `json:"status,omitempty"`
}

type TelemetryPolicySpec struct {
// Identifies the target gateways to which this policy attaches (GEP-713).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we must be specific here that at least one target is required

TargetRefs []NamespacedPolicyTargetReference `json:"targetRefs"`

// Configuration for distributed tracing options.
Copy link
Copy Markdown
Member

@rikatz rikatz May 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment must be more complete. Think on users using this API. We must write here what is tracing, what Gateway users will get from it.

This applies for every comment on every struct, I would recommend taking a look into https://gateway-api.sigs.k8s.io/reference/api-spec/main/spec/#backendtlspolicy

Each API field must contain:

  • What is the field
  • What happens when a user configure that field (or in the absence of that field)
  • Are there specific conflicting options?
  • Anything else that may help users to use

Also, given this is a field documentation for implementations, you must consider:

  • What an implementation must know when considering that field? (does setting it has any caveat? Does the implementation needs to consider some condition status?)
  • What the permutation of possible values can cause that the implementation must care? (think on conformance tests)

Think on https://gateway-api.sigs.k8s.io/guides/api-design/. The documentation must be separated between user facing and implementation facing (when writing implementation/developers facing, use the tag <gateway:util:excludeFromCRD></gateway:util:excludeFromCRD>)

Also, when defining fields you must define what kind of support you expect for it. Eg.:

  • core - means that the field is required to be recognized by the implementation for this feature to work. eg.: targetRefs
  • extended - means that the feature/field should be implemented by implementations that claim its support, but are not mandatory for the whole feature. eg.: a telemetryPolicy may support tracing, or metrics, or accessLogs, but doesn't require all of them to work
  • implementationSpecific - We don't do conformance test, and a user using it knows they will not have portability between implementations. We should avoid this kind of support, and use only if strictly required on new apis. An example is my question about the cel matching for logs, as I don't think every implementation supports it

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 defining support levels at this stage is very important. Also worth keeping in mind, everything in core and extended should be covered by conformance tests, otherwise it risks becoming implementation-specific anyway.

Tracing *TracingConfig `json:"tracing,omitempty"`

// Configuration for metric generation and exports.
Metrics *MetricsConfig `json:"metrics,omitempty"`

// Configuration for access log generation.
AccessLogs *AccessLogsConfig `json:"accessLogs,omitempty"`
}

// TelemetryMode defines the enablement state of a telemetry signal.
type TelemetryMode string

const (
// TelemetryModeOn explicitly enables the telemetry signal.
TelemetryModeOn TelemetryMode = "On"
// TelemetryModeOff explicitly disables the telemetry signal.
TelemetryModeOff TelemetryMode = "Off"
)

// --- Tracing Types ---

type TracingConfig struct {
// Mode explicitly controls if tracing is enabled. Valid values are "On" or "Off".
// +kubebuilder:validation:Enum=On;Off
// +kubebuilder:default=On
Mode TelemetryMode `json:"mode,omitempty"`

// Specifies the tracing backend. Includes type (e.g., "OTLP") and endpoint.
Provider *TracingProvider `json:"provider,omitempty"`

// The base sampling probability for traces.
SamplingRate *Fraction `json:"samplingRate,omitempty"`

// Configures whether to respect the sampling decision of the parent span.
ParentBasedSampling *ParentBasedSampling `json:"parentBasedSampling,omitempty"`

// Allows appending custom tags/attributes to spans.
CustomAttributes []CustomAttribute `json:"customAttributes,omitempty"`
}

type TracingProvider struct {
Endpoint string `json:"endpoint,omitempty"`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about TLS options and custom headers, per #4775 (comment)

}

type Fraction struct {
Numerator int32 `json:"numerator"`

// +kubebuilder:default=100
// +kubebuilder:validation:Minimum=1
Denominator int32 `json:"denominator,omitempty"` // Allows e.g., 1 / 10000 for 0.01%
}

type ParentBasedSampling struct {
// Mode explicitly controls if parent-based sampling is enabled. Valid values are "On" or "Off".
// +kubebuilder:validation:Enum=On;Off
// +kubebuilder:default=On
Mode TelemetryMode `json:"mode,omitempty"`

// The sampling rate to apply when the parent span decision is used.
SamplingRate *Fraction `json:"samplingRate,omitempty"`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this sampling rate needed? If we're already inheriting from the parent, isn't that making the decision on whether or not to sample?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the samplingRate in this context treated as a fallback strategy when no sampling decision has been propagated from the parent?

}

// CustomAttributeType defines the source of a trace attribute's value.
type CustomAttributeType string

const (
// CustomAttributeTypeHeader extracts the value from an HTTP header.
CustomAttributeTypeHeader CustomAttributeType = "Header"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we be explicit of what can be added or not? I would expect that we are careful with things like "Authorization" headers and other sensitive headers.

// CustomAttributeTypeMetadata extracts the value from proxy metadata or context.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, Metadata is a bit of a confusing concept in this context. I am worried we are making it very "Envoy" centric (unless @sjberman tells me he gets the concept of "metadata" here out of Nginx)

As we are leaning towards OTEL, I think it would be good to be more explicit on what each attribute means (eg.: https://opentelemetry.io/docs/specs/semconv/registry/attributes/http/ ) and use some OTEL references here

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No concept of metadata in nginx.

CustomAttributeTypeMetadata CustomAttributeType = "Metadata"
// CustomAttributeTypeLiteral provides a static, user-defined string value.
CustomAttributeTypeLiteral CustomAttributeType = "Literal"
)

type CustomAttribute struct {
// Name is the key of the attribute as it will appear in the trace span.
Name string `json:"name"`

// Type specifies where the attribute value comes from.
// Valid values are "Header", "Metadata", or "Literal".
// +kubebuilder:validation:Enum=Header;Metadata;Literal
Type CustomAttributeType `json:"type"`

// HeaderName specifies the HTTP header to extract the value from.
// This is required if Type is "Header".
HeaderName *string `json:"headerName,omitempty"`

// MetadataKey specifies the proxy/context metadata key to extract the value from.
// This is required if Type is "Metadata".
MetadataKey *string `json:"metadataKey,omitempty"`

// LiteralValue specifies a static string value to attach.
// This is required if Type is "Literal".
LiteralValue *string `json:"literalValue,omitempty"`
}

// --- Metrics Types ---

type MetricsConfig struct {
// Mode explicitly controls if metric generation is enabled. Valid values are "On" or "Off".
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean to turn off metrics generation? It does not seem reasonable to entirely disable all metrics for an application

// +kubebuilder:validation:Enum=On;Off
// +kubebuilder:default=On
Mode TelemetryMode `json:"mode,omitempty"`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all the modes, what happens if I do not set the field? is it explicitly OFF or can it be "default for the proxy"?

What if I do not attach a TelemetryPolicy at all?


// List of configurations to customize specific metric families.
Overrides []MetricOverride `json:"overrides,omitempty"`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this actually do? Is it customizing an existing metric? adding a new one?

}

type MetricOverride struct {
// The metric name to override (e.g., "http_requests_total" or "gateway.networking.k8s.io/http/request_count").
Name string `json:"name"`

// Type of the metric (e.g., "Counter", "Histogram").
Type string `json:"type,omitempty"`

// Defines custom attributes to attach to the metric.
// These are appended to the standard labels emitted by the proxy.
Attributes []MetricAttribute `json:"attributes,omitempty"`
}

// MetricAttributeType defines the source of a metric attribute's value.
type MetricAttributeType string

const (
// MetricAttributeTypeHeader extracts the value from an HTTP header.
MetricAttributeTypeHeader MetricAttributeType = "Header"
// MetricAttributeTypeMetadata extracts the value from proxy metadata or context.
MetricAttributeTypeMetadata MetricAttributeType = "Metadata"
// MetricAttributeTypeLiteral provides a static, user-defined string value.
MetricAttributeTypeLiteral MetricAttributeType = "Literal"
)

type MetricAttribute struct {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we can add fields only right, not remove them?

// Name is the key of the attribute as it will appear in the metric.
Name string `json:"name"`

// Type specifies where the attribute value comes from.
// Valid values are "Header", "Metadata", or "Literal".
// +kubebuilder:validation:Enum=Header;Metadata;Literal
Type MetricAttributeType `json:"type"`

// HeaderName specifies the HTTP header to extract the value from.
// This is required if Type is "Header".
HeaderName *string `json:"headerName,omitempty"`

// MetadataKey specifies the proxy/context metadata key to extract the value from.
// This is required if Type is "Metadata".
MetadataKey *string `json:"metadataKey,omitempty"`

// LiteralValue specifies a static string value to attach.
// This is required if Type is "Literal".
LiteralValue *string `json:"literalValue,omitempty"`
}

// --- Access Logs Types ---

type AccessLogsConfig struct {
// Mode explicitly controls if access logging is enabled. Valid values are "On" or "Off".
// +kubebuilder:validation:Enum=On;Off
// +kubebuilder:default=On
Mode TelemetryMode `json:"mode,omitempty"`

// CEL expression for advanced filtering (e.g., matching response codes, headers).
Matches string `json:"matches,omitempty"`

// A list of specific fields or headers to include in the logs.
Fields []string `json:"fields,omitempty"`

// A list of specific fields to include in the logs, specifying their source.
Fields []LogField `json:"fields,omitempty"`
Comment on lines +332 to +336
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dupe field?

}

// LogFieldType defines the source of a log field's value.
type LogFieldType string

const (
// LogFieldTypeHeader extracts the value from an HTTP header.
LogFieldTypeHeader LogFieldType = "Header"
// LogFieldTypeMetadata extracts the value from proxy metadata or context.
LogFieldTypeMetadata LogFieldType = "Metadata"
// LogFieldTypeLiteral provides a static, user-defined string value.
LogFieldTypeLiteral LogFieldType = "Literal"
// LogFieldTypeStandard extracts a standard proxy log value (e.g., duration, start time).
LogFieldTypeStandard LogFieldType = "Standard"
)

type LogField struct {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me what the fields LogField in the log represents, is it a JSON field for specifying additional key-value pairs?

How do you plan to support customization of nested JSON structures? These are commonly used in access logs, e.g., in the Airlock Microgateway we use the Elastic Common Schema (ECS).

Since access logs tend to be highly implementation-specific, IMO a format string or an extension point is probably the best approach here.

// Name is the key/name of the field as it will appear in the access log output.
Name string `json:"name"`

// Type specifies where the field value comes from.
// Valid values are "Header", "Metadata", "Literal", or "Standard".
// +kubebuilder:validation:Enum=Header;Metadata;Literal;Standard
Type LogFieldType `json:"type"`

// HeaderName specifies the HTTP header to extract the value from.
// This is required if Type is "Header".
HeaderName *string `json:"headerName,omitempty"`

// MetadataKey specifies the proxy/context metadata key to extract the value from.
// This is required if Type is "Metadata".
MetadataKey *string `json:"metadataKey,omitempty"`

// LiteralValue specifies a static string value to attach.
// This is required if Type is "Literal".
LiteralValue *string `json:"literalValue,omitempty"`

// StandardValue specifies a standard log property (e.g., "RequestStartTime", "Duration").
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these defined? or implementation specific?

// This is required if Type is "Standard".
StandardValue *string `json:"standardValue,omitempty"`
}

// --- Policy Status ---

// TelemetryPolicyStatus defines the observed state of TelemetryPolicy.
type TelemetryPolicyStatus struct {
// For Policy Status API conventions, see:
// https://gateway-api.sigs.k8s.io/geps/gep-713/#the-status-stanza-of-policy-objects
//
// Ancestors is a list of ancestor resources (usually Backend) that are
// associated with the policy, and the status of the policy with respect to
// each ancestor. When this policy attaches to a parent, the controller that
// manages the parent and the ancestors MUST add an entry to this list when
// the controller first sees the policy and SHOULD update the entry as
// appropriate when the relevant ancestor is modified.
//
// Note that choosing the relevant ancestor is left to the Policy designers;
// an important part of Policy design is designing the right object level at
// which to namespace this status.
//
// Note also that implementations MUST ONLY populate ancestor status for
// the Ancestor resources they are responsible for. Implementations MUST
// use the ControllerName field to uniquely identify the entries in this list
// that they are responsible for.
//
// Note that to achieve this, the list of PolicyAncestorStatus structs
// MUST be treated as a map with a composite key, made up of the AncestorRef
// and ControllerName fields combined.
//
// A maximum of 16 ancestors will be represented in this list. An empty list
// means the Policy is not relevant for any ancestors.
//
// If this slice is full, implementations MUST NOT add further entries.
// Instead they MUST consider the policy unimplementable and signal that
// on any related resources such as the ancestor that would be referenced
// here.
//
// +required
// +listType=atomic
// +kubebuilder:validation:MaxItems=16
Ancestors []PolicyAncestorStatus `json:"ancestors"`
}
```

## Comparison with Prior Art
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API seems to basically just be directly Istio's API. It doesn't seem like we really explored prior art beyond that even though there was substantial feedback on the original agentic-networking exploring alternative approaches. Can we get some of that context here? Even if its 'ideas considered but rejected' its still useful

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also include the Airlock Microgateway Telemetry CR and the NGINX Tracing API as prior art?


### Istio

[Istio](https://istio.io/)'s `Telemetry` API is the most direct prior art that inspired this proposal. It allows configuring observability at the mesh, namespace, and workload level.

* **Metrics**: Istio allows users to enable/disable specific metrics, add custom dimensions, and configure providers.
* **Logs**: Istio supports access logging configurations with CEL-like expressions for advanced filtering.
* **Traces**: Istio supports probabilistic sampling, context propagation, and custom span tags.
* **Customization**: For advanced telemetry use-cases not natively covered by the `Telemetry` API, Istio users can fall back to using `EnvoyFilter` resources. While highly flexible, `EnvoyFilter` requires deep knowledge of Envoy's internal xDS API. This is tightly coupled to the data plane implementation and can be brittle across version upgrades.
* **Comparison**: The proposed `TelemetryPolicy` adapts Istio's powerful intent-based capabilities to the standardized Gateway API attachment model.

### Envoy Gateway

[Envoy Gateway](https://gateway.envoyproxy.io/) configures observability through two distinct custom resources: `EnvoyGateway` for the control plane and `EnvoyProxy` for the underlying data plane proxies.

* **Metrics**: Envoy Gateway allows configuring Prometheus and OpenTelemetry sinks for both the control plane (using `EnvoyGateway` CRD) and the data plane proxies (using the `EnvoyProxy` CRD).
* **Logs**: Proxy access logs are configured via the `EnvoyProxy` resource. It supports exporting to file, OTLP, or gRPC Access Log Service (ALS) sinks. It uses CEL expressions for smart filtering (e.g., matching specific headers), and allows applying log configurations at the Route or Listener level.
* **Tracing**: Tracing is configured in the `EnvoyProxy` resource. It allows configuring sampling and supports appending custom tags derived from literals, environment variables, or request headers.
* **Customization**: For advanced telemetry use-cases not covered natively, users can fall back to the `EnvoyPatchPolicy` API to mutate the underlying xDS configuration using JSON Patch semantics. This is similar to Istio's `EnvoyFilter`.
* **Comparison**: While Envoy Gateway provides a robust, native telemetry configuration, it is tightly coupled to infrastructure-oriented CRDs. The proposed `TelemetryPolicy` allows users to configure telemetry behaviors using a portable `targetRef` model, without binding their observability intent to an Envoy-specific schema.

### Kuadrant

[Kuadrant](https://kuadrant.io/) provides observability for API management features like rate limiting and authentication. It is configured through a mix of its own custom resources and the underlying gateway's APIs.

* **Metrics**: Kuadrant enables metrics via the `Kuadrant` CR. It also introduces its own `TelemetryPolicy` API (extensions.kuadrant.io/v1alpha1) to add custom dimensions to metrics.
* **Logs**: For proxy access logging, Kuadrant relies on the underlying gateway provider (e.g., Istio's Telemetry API). However, it configures request correlation across its own components (Authorino, Limitador, and Wasm-shim) by specifying HTTP header identifiers in the `Kuadrant` CR.
* **Tracing**: Tracing is configured centrally via the `Kuadrant` CR. It exports OpenTelemetry spans for both the control plane and data plane components. It supports global trace filtering levels to control the verbosity of exported spans.
* **Customization**: To make low-level, custom modifications to the data plane configuration that are not supported by Kuadrant's native APIs, users can bypass Kuadrant and directly use the underlying gateway's mechanisms.
* **Comparison**: While Kuadrant provides powerful, identity-aware telemetry (like token tracking per user), its configuration is fragmented across the `Kuadrant` CR, components specific CRDs, its custom extension `TelemetryPolicy`, and the underlying gateway's native APIs. The proposed `TelemetryPolicy` aims to unify these intent-based capabilities into a single, provider-agnostic resource.