diff --git a/geps/gep-4768/index.md b/geps/gep-4768/index.md index 644c312130..440661aee3 100644 --- a/geps/gep-4768/index.md +++ b/geps/gep-4768/index.md @@ -61,9 +61,392 @@ To mitigate the challenge of complex merging semantics, this GEP restricts confi - **Access Logs**: Filtering for smart logging (e.g., only log 5xx errors or high latency), multi-protocol support, and log format customization (including field selection). - **Export Configuration**: Supporting TLS connections to telemetry collectors and the ability to inject custom headers (e.g., `Authorization`) into telemetry requests. -## Request Flow +### Request Flow * A platform operator creates a `TelemetryPolicy` resource targeting a `Gateway`. * The Gateway API implementation reconciles this resource and configures the underlying data plane. * The data plane extracts the specified signals and exports them to the telemetry infrastructure. +### The `TelemetryPolicy` Specification + +We propose the `TelemetryPolicy` as a direct policy attachment in the `gateway.networking.k8s.io` API group. See [GEP-713](https://gateway-api.sigs.k8s.io/geps/gep-713/#classes-of-policies) for more information on direct attachment. + +The following is an example that demonstrates the structure of the `TelemetryPolicy`. + +```yaml +apiVersion: agentic.networking.x-k8s.io/v1alpha1 +kind: TelemetryPolicy +metadata: + name: standard-telemetry + namespace: prod-ns +spec: + # GEP-713 Attachment + targetRefs: + - group: gateway.networking.k8s.io + kind: Gateway + name: my-gateway + + # 1. Tracing Configuration + tracing: + mode: "On" + provider: + endpoint: "otel-collector.monitoring.svc:4317" + samplingRate: + numerator: 5 # Represents 5/100 (5%) because denominator defaults to 100 + parentBasedSampling: + mode: "On" + samplingRate: + numerator: 50 # Represents 50/100 (50%) + customAttributes: + - name: "env" + type: Literal + literalValue: "production" + - name: "mcp_task_name" + type: Metadata + metadataKey: "my.custom.filter.mcp_task_name" + + # 2. Metrics Configuration + metrics: + mode: "On" + overrides: + - name: "example.com/http/request_count" + type: Counter + attributes: # Inject custom attributes/labels + - name: "x-model-id" + type: Header + headerName: "X-Model-Id" + - name: "mcp_task_name" + type: Metadata + metadataKey: "my.custom.filter.mcp_task_name" + - name: "environment" + type: Literal + literalValue: "production" + + # 3. Access Logs Configuration + accessLogs: + mode: "Off" # Explicitly disabled while keeping the configuration intact + matches: "response.code >= 500" # Conditional logging, CEL filtering for errors + fields: # Configure specific fields to include, indicating their source + - name: "start_time" + type: Standard + standardValue: "RequestStartTime" + - name: "response_code" + type: Standard + standardValue: "ResponseCode" + - name: "x-token-usage" + type: Header + headerName: "X-Token-Usage" + - name: "mcp_task_name" + type: Metadata + metadataKey: "my.custom.filter.mcp_task_name" +``` + +#### Detailed Resource Description + +The following are the Go structs modeling the proposed specification. + +```Go +// TelemetryPolicy defines a direct policy attachment to configure +// observability signals for Gateways. +type TelemetryPolicy struct { + metav1.TypeMeta `json:",inline"` + metav1.ObjectMeta `json:"metadata,omitempty"` + + Spec TelemetryPolicySpec `json:"spec"` + + // status defines the observed state of TelemetryPolicy. + // +optional + Status TelemetryPolicyStatus `json:"status,omitempty"` +} + +type TelemetryPolicySpec struct { + // Identifies the target gateways to which this policy attaches (GEP-713). + TargetRefs []NamespacedPolicyTargetReference `json:"targetRefs"` + + // Configuration for distributed tracing options. + Tracing *TracingConfig `json:"tracing,omitempty"` + + // Configuration for metric generation and exports. + Metrics *MetricsConfig `json:"metrics,omitempty"` + + // Configuration for access log generation. + AccessLogs *AccessLogsConfig `json:"accessLogs,omitempty"` +} + +// TelemetryMode defines the enablement state of a telemetry signal. +type TelemetryMode string + +const ( + // TelemetryModeOn explicitly enables the telemetry signal. + TelemetryModeOn TelemetryMode = "On" + // TelemetryModeOff explicitly disables the telemetry signal. + TelemetryModeOff TelemetryMode = "Off" +) + +// --- Tracing Types --- + +type TracingConfig struct { + // Mode explicitly controls if tracing is enabled. Valid values are "On" or "Off". + // +kubebuilder:validation:Enum=On;Off + // +kubebuilder:default=On + Mode TelemetryMode `json:"mode,omitempty"` + + // Specifies the tracing backend. Includes type (e.g., "OTLP") and endpoint. + Provider *TracingProvider `json:"provider,omitempty"` + + // The base sampling probability for traces. + SamplingRate *Fraction `json:"samplingRate,omitempty"` + + // Configures whether to respect the sampling decision of the parent span. + ParentBasedSampling *ParentBasedSampling `json:"parentBasedSampling,omitempty"` + + // Allows appending custom tags/attributes to spans. + CustomAttributes []CustomAttribute `json:"customAttributes,omitempty"` +} + +type TracingProvider struct { + Endpoint string `json:"endpoint,omitempty"` +} + +type Fraction struct { + Numerator int32 `json:"numerator"` + + // +kubebuilder:default=100 + // +kubebuilder:validation:Minimum=1 + Denominator int32 `json:"denominator,omitempty"` // Allows e.g., 1 / 10000 for 0.01% +} + +type ParentBasedSampling struct { + // Mode explicitly controls if parent-based sampling is enabled. Valid values are "On" or "Off". + // +kubebuilder:validation:Enum=On;Off + // +kubebuilder:default=On + Mode TelemetryMode `json:"mode,omitempty"` + + // The sampling rate to apply when the parent span decision is used. + SamplingRate *Fraction `json:"samplingRate,omitempty"` +} + +// CustomAttributeType defines the source of a trace attribute's value. +type CustomAttributeType string + +const ( + // CustomAttributeTypeHeader extracts the value from an HTTP header. + CustomAttributeTypeHeader CustomAttributeType = "Header" + // CustomAttributeTypeMetadata extracts the value from proxy metadata or context. + CustomAttributeTypeMetadata CustomAttributeType = "Metadata" + // CustomAttributeTypeLiteral provides a static, user-defined string value. + CustomAttributeTypeLiteral CustomAttributeType = "Literal" +) + +type CustomAttribute struct { + // Name is the key of the attribute as it will appear in the trace span. + Name string `json:"name"` + + // Type specifies where the attribute value comes from. + // Valid values are "Header", "Metadata", or "Literal". + // +kubebuilder:validation:Enum=Header;Metadata;Literal + Type CustomAttributeType `json:"type"` + + // HeaderName specifies the HTTP header to extract the value from. + // This is required if Type is "Header". + HeaderName *string `json:"headerName,omitempty"` + + // MetadataKey specifies the proxy/context metadata key to extract the value from. + // This is required if Type is "Metadata". + MetadataKey *string `json:"metadataKey,omitempty"` + + // LiteralValue specifies a static string value to attach. + // This is required if Type is "Literal". + LiteralValue *string `json:"literalValue,omitempty"` +} + +// --- Metrics Types --- + +type MetricsConfig struct { + // Mode explicitly controls if metric generation is enabled. Valid values are "On" or "Off". + // +kubebuilder:validation:Enum=On;Off + // +kubebuilder:default=On + Mode TelemetryMode `json:"mode,omitempty"` + + // List of configurations to customize specific metric families. + Overrides []MetricOverride `json:"overrides,omitempty"` +} + +type MetricOverride struct { + // The metric name to override (e.g., "http_requests_total" or "gateway.networking.k8s.io/http/request_count"). + Name string `json:"name"` + + // Type of the metric (e.g., "Counter", "Histogram"). + Type string `json:"type,omitempty"` + + // Defines custom attributes to attach to the metric. + // These are appended to the standard labels emitted by the proxy. + Attributes []MetricAttribute `json:"attributes,omitempty"` +} + +// MetricAttributeType defines the source of a metric attribute's value. +type MetricAttributeType string + +const ( + // MetricAttributeTypeHeader extracts the value from an HTTP header. + MetricAttributeTypeHeader MetricAttributeType = "Header" + // MetricAttributeTypeMetadata extracts the value from proxy metadata or context. + MetricAttributeTypeMetadata MetricAttributeType = "Metadata" + // MetricAttributeTypeLiteral provides a static, user-defined string value. + MetricAttributeTypeLiteral MetricAttributeType = "Literal" +) + +type MetricAttribute struct { + // Name is the key of the attribute as it will appear in the metric. + Name string `json:"name"` + + // Type specifies where the attribute value comes from. + // Valid values are "Header", "Metadata", or "Literal". + // +kubebuilder:validation:Enum=Header;Metadata;Literal + Type MetricAttributeType `json:"type"` + + // HeaderName specifies the HTTP header to extract the value from. + // This is required if Type is "Header". + HeaderName *string `json:"headerName,omitempty"` + + // MetadataKey specifies the proxy/context metadata key to extract the value from. + // This is required if Type is "Metadata". + MetadataKey *string `json:"metadataKey,omitempty"` + + // LiteralValue specifies a static string value to attach. + // This is required if Type is "Literal". + LiteralValue *string `json:"literalValue,omitempty"` +} + +// --- Access Logs Types --- + +type AccessLogsConfig struct { + // Mode explicitly controls if access logging is enabled. Valid values are "On" or "Off". + // +kubebuilder:validation:Enum=On;Off + // +kubebuilder:default=On + Mode TelemetryMode `json:"mode,omitempty"` + + // CEL expression for advanced filtering (e.g., matching response codes, headers). + Matches string `json:"matches,omitempty"` + + // A list of specific fields or headers to include in the logs. + Fields []string `json:"fields,omitempty"` + + // A list of specific fields to include in the logs, specifying their source. + Fields []LogField `json:"fields,omitempty"` +} + +// LogFieldType defines the source of a log field's value. +type LogFieldType string + +const ( + // LogFieldTypeHeader extracts the value from an HTTP header. + LogFieldTypeHeader LogFieldType = "Header" + // LogFieldTypeMetadata extracts the value from proxy metadata or context. + LogFieldTypeMetadata LogFieldType = "Metadata" + // LogFieldTypeLiteral provides a static, user-defined string value. + LogFieldTypeLiteral LogFieldType = "Literal" + // LogFieldTypeStandard extracts a standard proxy log value (e.g., duration, start time). + LogFieldTypeStandard LogFieldType = "Standard" +) + +type LogField struct { + // Name is the key/name of the field as it will appear in the access log output. + Name string `json:"name"` + + // Type specifies where the field value comes from. + // Valid values are "Header", "Metadata", "Literal", or "Standard". + // +kubebuilder:validation:Enum=Header;Metadata;Literal;Standard + Type LogFieldType `json:"type"` + + // HeaderName specifies the HTTP header to extract the value from. + // This is required if Type is "Header". + HeaderName *string `json:"headerName,omitempty"` + + // MetadataKey specifies the proxy/context metadata key to extract the value from. + // This is required if Type is "Metadata". + MetadataKey *string `json:"metadataKey,omitempty"` + + // LiteralValue specifies a static string value to attach. + // This is required if Type is "Literal". + LiteralValue *string `json:"literalValue,omitempty"` + + // StandardValue specifies a standard log property (e.g., "RequestStartTime", "Duration"). + // This is required if Type is "Standard". + StandardValue *string `json:"standardValue,omitempty"` +} + +// --- Policy Status --- + +// TelemetryPolicyStatus defines the observed state of TelemetryPolicy. +type TelemetryPolicyStatus struct { + // For Policy Status API conventions, see: + // https://gateway-api.sigs.k8s.io/geps/gep-713/#the-status-stanza-of-policy-objects + // + // Ancestors is a list of ancestor resources (usually Backend) that are + // associated with the policy, and the status of the policy with respect to + // each ancestor. When this policy attaches to a parent, the controller that + // manages the parent and the ancestors MUST add an entry to this list when + // the controller first sees the policy and SHOULD update the entry as + // appropriate when the relevant ancestor is modified. + // + // Note that choosing the relevant ancestor is left to the Policy designers; + // an important part of Policy design is designing the right object level at + // which to namespace this status. + // + // Note also that implementations MUST ONLY populate ancestor status for + // the Ancestor resources they are responsible for. Implementations MUST + // use the ControllerName field to uniquely identify the entries in this list + // that they are responsible for. + // + // Note that to achieve this, the list of PolicyAncestorStatus structs + // MUST be treated as a map with a composite key, made up of the AncestorRef + // and ControllerName fields combined. + // + // A maximum of 16 ancestors will be represented in this list. An empty list + // means the Policy is not relevant for any ancestors. + // + // If this slice is full, implementations MUST NOT add further entries. + // Instead they MUST consider the policy unimplementable and signal that + // on any related resources such as the ancestor that would be referenced + // here. + // + // +required + // +listType=atomic + // +kubebuilder:validation:MaxItems=16 + Ancestors []PolicyAncestorStatus `json:"ancestors"` +} +``` + +## Comparison with Prior Art + +### Istio + +[Istio](https://istio.io/)'s `Telemetry` API is the most direct prior art that inspired this proposal. It allows configuring observability at the mesh, namespace, and workload level. + +* **Metrics**: Istio allows users to enable/disable specific metrics, add custom dimensions, and configure providers. +* **Logs**: Istio supports access logging configurations with CEL-like expressions for advanced filtering. +* **Traces**: Istio supports probabilistic sampling, context propagation, and custom span tags. +* **Customization**: For advanced telemetry use-cases not natively covered by the `Telemetry` API, Istio users can fall back to using `EnvoyFilter` resources. While highly flexible, `EnvoyFilter` requires deep knowledge of Envoy's internal xDS API. This is tightly coupled to the data plane implementation and can be brittle across version upgrades. +* **Comparison**: The proposed `TelemetryPolicy` adapts Istio's powerful intent-based capabilities to the standardized Gateway API attachment model. + +### Envoy Gateway + +[Envoy Gateway](https://gateway.envoyproxy.io/) configures observability through two distinct custom resources: `EnvoyGateway` for the control plane and `EnvoyProxy` for the underlying data plane proxies. + +* **Metrics**: Envoy Gateway allows configuring Prometheus and OpenTelemetry sinks for both the control plane (using `EnvoyGateway` CRD) and the data plane proxies (using the `EnvoyProxy` CRD). +* **Logs**: Proxy access logs are configured via the `EnvoyProxy` resource. It supports exporting to file, OTLP, or gRPC Access Log Service (ALS) sinks. It uses CEL expressions for smart filtering (e.g., matching specific headers), and allows applying log configurations at the Route or Listener level. +* **Tracing**: Tracing is configured in the `EnvoyProxy` resource. It allows configuring sampling and supports appending custom tags derived from literals, environment variables, or request headers. +* **Customization**: For advanced telemetry use-cases not covered natively, users can fall back to the `EnvoyPatchPolicy` API to mutate the underlying xDS configuration using JSON Patch semantics. This is similar to Istio's `EnvoyFilter`. +* **Comparison**: While Envoy Gateway provides a robust, native telemetry configuration, it is tightly coupled to infrastructure-oriented CRDs. The proposed `TelemetryPolicy` allows users to configure telemetry behaviors using a portable `targetRef` model, without binding their observability intent to an Envoy-specific schema. + +### Kuadrant + +[Kuadrant](https://kuadrant.io/) provides observability for API management features like rate limiting and authentication. It is configured through a mix of its own custom resources and the underlying gateway's APIs. + +* **Metrics**: Kuadrant enables metrics via the `Kuadrant` CR. It also introduces its own `TelemetryPolicy` API (extensions.kuadrant.io/v1alpha1) to add custom dimensions to metrics. +* **Logs**: For proxy access logging, Kuadrant relies on the underlying gateway provider (e.g., Istio's Telemetry API). However, it configures request correlation across its own components (Authorino, Limitador, and Wasm-shim) by specifying HTTP header identifiers in the `Kuadrant` CR. +* **Tracing**: Tracing is configured centrally via the `Kuadrant` CR. It exports OpenTelemetry spans for both the control plane and data plane components. It supports global trace filtering levels to control the verbosity of exported spans. +* **Customization**: To make low-level, custom modifications to the data plane configuration that are not supported by Kuadrant's native APIs, users can bypass Kuadrant and directly use the underlying gateway's mechanisms. +* **Comparison**: While Kuadrant provides powerful, identity-aware telemetry (like token tracking per user), its configuration is fragmented across the `Kuadrant` CR, components specific CRDs, its custom extension `TelemetryPolicy`, and the underlying gateway's native APIs. The proposed `TelemetryPolicy` aims to unify these intent-based capabilities into a single, provider-agnostic resource. +