Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
67 changes: 67 additions & 0 deletions geps/gep-4768/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# GEP: Standardized Telemetry API

* Issue: #4768
* Status: Provisional

## TLDR

This proposal introduces a standardized, provider-agnostic Telemetry API to configure observability signals (metrics, access logs, and traces) for North/South (Gateway) traffic, addressing the fragmentation caused by vendor-specific CRDs.

## Goals
Comment thread
rikatz marked this conversation as resolved.

* Establish a standardized model to configure provider-agnostic telemetry (metrics, access logs, and traces) for Gateways.

## Non-Goals

1. Defining how the telemetry is exported (sinks/shippers) beyond specifying the provider endpoint and relevant connectivity parameters.
2. Replacing the underlying telemetry infrastructure (OTLP collectors, Prometheus, etc.).
3. Standardizing metrics; this proposal exclusively focuses on the telemetry configuration API.

## Introduction / Overview

This GEP proposes the addition of a standardized, provider-agnostic Telemetry API to the Gateway API project. The proposal aims to define a unified configuration model for the generation and propagation of telemetry signals (i.e., metrics, access logs, distributed traces) for North/South (Gateway) traffic.

The API focuses on providing a consistent way to express observability intent, such as sampling rates for tracing, metric customization, and log filtering, regardless of the underlying data plane implementation.

## Purpose (Why and Who)

### The Fragmentation of Observability

In the current Kubernetes landscape, the "Who, What, Where, and How Long" of network traffic is answered differently depending on the underlying proxy technology. While the Gateway API specification has unified how traffic is routed via `HTTPRoute` and `Gateway`, it has deferred the standardization of how that traffic is observed. This deferral has led to "Observability Lock-in". Platform Engineering teams are forced to learn and manage distinct APIs for each environment. A standardized telemetry API is necessary to decouple the intent of observability from the implementation. Without such standardization it is difficult for platform owners to:

1. Enforce consistent auditing and observability standards across different infrastructure providers.
2. Support emerging workloads like AI Agents, which elevate the criticality of observability due to their autonomous, non-deterministic nature and requirements for specialized signals.

### Who

- **Platform Operators**: Need to ensure uniform observability across all networking infrastructure.
- **Observability Teams**: Responsible for the governance of telemetry data. They need to define and enforce standardized schemas and collection policies across the entire organization.
- **Security/Auditing Teams**: Require a standardized audit trail for all traffic, an increasingly important need with the emergence of autonomous agent actions.
- **Application Developers**: Benefit from consistent metrics and traces for debugging without worrying about the underlying gateway technology.

## API
Comment thread
gkhom marked this conversation as resolved.

### Policy Attachment vs. Inline Configuration
Comment thread
gkhom marked this conversation as resolved.

A key area of discussion for this GEP is whether this should be a standalone Policy Attachment (e.g., `TelemetryPolicy`) or inline configuration within `Gateway` or `HTTPRoute` resources.

This proposal argues that the Policy Attachment model is the most effective approach to meet the stated goals, primarily for two reasons:
Comment thread
gkhom marked this conversation as resolved.

1. **Separation of Concerns**: It allows different personas to manage Gateway infrastructure independently from the configuration of telemetry signals. Telemetry is typically configured by platform, observability, or security engineers rather than application developers. This also implies that HTTPRoute is not the ideal resource to target for the initial API implementation.
2. **Uniformity**: It enables a single policy to be applied uniformly across a set of Gateways, eliminating the need to duplicate complex telemetry configurations across individual resources.

To mitigate the challenge of complex merging semantics, this GEP restricts configuration such that only a single `TelemetryPolicy` can target a specific `Gateway` at any given time. If multiple `TelemetryPolicy` resources target the same object, precedence is determined based on the creation timestamp. This will allow us to start with simple config and iterate based on feedback whether multiple TelemetryPolicies on the same target are needed.

### High-level Considerations:

- **Tracing**: Configuration for OTLP endpoints, sampling rates (probabilistic and parent-based), and custom resource/span attributes.
- **Metrics**: Ability to enable/disable specific metric families and customize dimensions (labels/attributes).
- **Access Logs**: Filtering for smart logging (e.g., only log 5xx errors or high latency), multi-protocol support, and log format customization (including field selection).
- **Export Configuration**: Supporting TLS connections to telemetry collectors and the ability to inject custom headers (e.g., `Authorization`) into telemetry requests.

Comment thread
gkhom marked this conversation as resolved.
## Request Flow

* A platform operator creates a `TelemetryPolicy` resource targeting a `Gateway`.
* The Gateway API implementation reconciles this resource and configures the underlying data plane.
* The data plane extracts the specified signals and exports them to the telemetry infrastructure.

11 changes: 11 additions & 0 deletions geps/gep-4768/metadata.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
apiVersion: internal.gateway.networking.k8s.io/v1alpha1
kind: GEPDetails
number: 4768
name: Standardized Telemetry API
status: Provisional
authors:
Comment thread
rikatz marked this conversation as resolved.
- gkhom
seeAlso:
- https://github.com/kubernetes-sigs/kube-agentic-networking/pull/69
- https://gateway-api.sigs.k8s.io/geps/gep-713/