-
Notifications
You must be signed in to change notification settings - Fork 728
gep-4768: Standardized Telemetry API (provisional) #4775
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
fdd2502
gep-4768: Standardized Telemetry API (provisional)
gkhom 39bb149
gep-4768: address initial feedback
gkhom 11de018
Apply suggestions from code review
gkhom 56dad37
gep-4768: mention custom resource attrs for tracing
gkhom a8b6045
gep-4768: addressing feedback
gkhom ec9ac0c
Apply suggestions from code review
gkhom 372055c
gep-4768: update separation of concerns
gkhom ab250ce
gep-4768: fix indentation metadata.yaml
gkhom File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,67 @@ | ||
| # GEP: Standardized Telemetry API | ||
|
|
||
| * Issue: #4768 | ||
| * Status: Provisional | ||
|
|
||
| ## TLDR | ||
|
|
||
| This proposal introduces a standardized, provider-agnostic Telemetry API to configure observability signals (metrics, access logs, and traces) for North/South (Gateway) traffic, addressing the fragmentation caused by vendor-specific CRDs. | ||
|
|
||
| ## Goals | ||
|
|
||
| * Establish a standardized model to configure provider-agnostic telemetry (metrics, access logs, and traces) for Gateways. | ||
|
|
||
| ## Non-Goals | ||
|
|
||
| 1. Defining how the telemetry is exported (sinks/shippers) beyond specifying the provider endpoint and relevant connectivity parameters. | ||
| 2. Replacing the underlying telemetry infrastructure (OTLP collectors, Prometheus, etc.). | ||
| 3. Standardizing metrics; this proposal exclusively focuses on the telemetry configuration API. | ||
|
|
||
| ## Introduction / Overview | ||
|
|
||
| This GEP proposes the addition of a standardized, provider-agnostic Telemetry API to the Gateway API project. The proposal aims to define a unified configuration model for the generation and propagation of telemetry signals (i.e., metrics, access logs, distributed traces) for North/South (Gateway) traffic. | ||
|
|
||
| The API focuses on providing a consistent way to express observability intent, such as sampling rates for tracing, metric customization, and log filtering, regardless of the underlying data plane implementation. | ||
|
|
||
| ## Purpose (Why and Who) | ||
|
|
||
| ### The Fragmentation of Observability | ||
|
|
||
| In the current Kubernetes landscape, the "Who, What, Where, and How Long" of network traffic is answered differently depending on the underlying proxy technology. While the Gateway API specification has unified how traffic is routed via `HTTPRoute` and `Gateway`, it has deferred the standardization of how that traffic is observed. This deferral has led to "Observability Lock-in". Platform Engineering teams are forced to learn and manage distinct APIs for each environment. A standardized telemetry API is necessary to decouple the intent of observability from the implementation. Without such standardization it is difficult for platform owners to: | ||
|
|
||
| 1. Enforce consistent auditing and observability standards across different infrastructure providers. | ||
| 2. Support emerging workloads like AI Agents, which elevate the criticality of observability due to their autonomous, non-deterministic nature and requirements for specialized signals. | ||
|
|
||
| ### Who | ||
|
|
||
| - **Platform Operators**: Need to ensure uniform observability across all networking infrastructure. | ||
| - **Observability Teams**: Responsible for the governance of telemetry data. They need to define and enforce standardized schemas and collection policies across the entire organization. | ||
| - **Security/Auditing Teams**: Require a standardized audit trail for all traffic, an increasingly important need with the emergence of autonomous agent actions. | ||
| - **Application Developers**: Benefit from consistent metrics and traces for debugging without worrying about the underlying gateway technology. | ||
|
|
||
| ## API | ||
|
gkhom marked this conversation as resolved.
|
||
|
|
||
| ### Policy Attachment vs. Inline Configuration | ||
|
gkhom marked this conversation as resolved.
|
||
|
|
||
| A key area of discussion for this GEP is whether this should be a standalone Policy Attachment (e.g., `TelemetryPolicy`) or inline configuration within `Gateway` or `HTTPRoute` resources. | ||
|
|
||
| This proposal argues that the Policy Attachment model is the most effective approach to meet the stated goals, primarily for two reasons: | ||
|
gkhom marked this conversation as resolved.
|
||
|
|
||
| 1. **Separation of Concerns**: It allows different personas to manage Gateway infrastructure independently from the configuration of telemetry signals. Telemetry is typically configured by platform, observability, or security engineers rather than application developers. This also implies that HTTPRoute is not the ideal resource to target for the initial API implementation. | ||
| 2. **Uniformity**: It enables a single policy to be applied uniformly across a set of Gateways, eliminating the need to duplicate complex telemetry configurations across individual resources. | ||
|
|
||
| To mitigate the challenge of complex merging semantics, this GEP restricts configuration such that only a single `TelemetryPolicy` can target a specific `Gateway` at any given time. If multiple `TelemetryPolicy` resources target the same object, precedence is determined based on the creation timestamp. This will allow us to start with simple config and iterate based on feedback whether multiple TelemetryPolicies on the same target are needed. | ||
|
|
||
| ### High-level Considerations: | ||
|
|
||
| - **Tracing**: Configuration for OTLP endpoints, sampling rates (probabilistic and parent-based), and custom resource/span attributes. | ||
| - **Metrics**: Ability to enable/disable specific metric families and customize dimensions (labels/attributes). | ||
| - **Access Logs**: Filtering for smart logging (e.g., only log 5xx errors or high latency), multi-protocol support, and log format customization (including field selection). | ||
| - **Export Configuration**: Supporting TLS connections to telemetry collectors and the ability to inject custom headers (e.g., `Authorization`) into telemetry requests. | ||
|
|
||
|
gkhom marked this conversation as resolved.
|
||
| ## Request Flow | ||
|
|
||
| * A platform operator creates a `TelemetryPolicy` resource targeting a `Gateway`. | ||
| * The Gateway API implementation reconciles this resource and configures the underlying data plane. | ||
| * The data plane extracts the specified signals and exports them to the telemetry infrastructure. | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,11 @@ | ||
| apiVersion: internal.gateway.networking.k8s.io/v1alpha1 | ||
| kind: GEPDetails | ||
| number: 4768 | ||
| name: Standardized Telemetry API | ||
| status: Provisional | ||
| authors: | ||
|
rikatz marked this conversation as resolved.
|
||
| - gkhom | ||
| seeAlso: | ||
| - https://github.com/kubernetes-sigs/kube-agentic-networking/pull/69 | ||
| - https://gateway-api.sigs.k8s.io/geps/gep-713/ | ||
|
|
||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.