-
Notifications
You must be signed in to change notification settings - Fork 746
gep-4768: Standardized Telemetry API (provisional) #4775
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
fdd2502
39bb149
11de018
56dad37
a8b6045
ec9ac0c
372055c
ab250ce
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,68 @@ | ||
| # GEP: Standardized Telemetry API | ||
|
|
||
| * Issue: #4768 | ||
| * Status: Provisional | ||
|
|
||
| ## TLDR | ||
|
|
||
| This proposal introduces a standardized, provider-agnostic Telemetry API to configure observability signals (metrics, access logs, and traces) for both North/South (Gateway) and East/West (Mesh) traffic, addressing the fragmentation caused by vendor-specific CRDs. | ||
|
|
||
| ## Goals | ||
|
|
||
| 1. Establish a standardized model to configure provider-agnostic telemetry (metrics, access logs, and traces) for both Gateway and Mesh. | ||
|
gkhom marked this conversation as resolved.
Outdated
|
||
| 2. Enable separation of concerns between the persona managing networking infrastructure (Platform Team) and the persona governing telemetry signals (Observability/Security Team). | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why is this a goal? i.e is there an existing pain point where we have a current solution that does not distinguish between responsibilities of this personas wrt to Telemetry?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My interpretation of the three roles and personas defined by the Gateway API is that they do not explicitly recognize that the persona responsible for gateway infrastructure might not be the same persona that is responsible for delivering consistent telemetry signals. So the reason to state it as a goal is to emphasize that this separation exists in (often larger) companies. If we do not want to account this use-case, I can remove this goal.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am a +1 of removing this goal for now. We must consider the Gateway owner the responsible for the telemetry policy at this moment. |
||
|
|
||
| ## Non-Goals | ||
|
|
||
| 1. Defining how the telemetry is exported (sinks/shippers) beyond specifying the provider endpoint. | ||
| 2. Replacing the underlying telemetry infrastructure (OTLP collectors, Prometheus, etc.). | ||
| 3. Standardizing metrics; this proposal exclusively focuses on the telemetry configuration API. | ||
|
|
||
| ## Introduction / Overview | ||
|
|
||
| This GEP proposes the addition of a standardized, provider-agnostic Telemetry API to the Gateway API project. The proposal aims to define a unified configuration model for the generation and propagation of telemetry signals (i.e., metrics, access logs, distributed traces) for both North/South (Gateway) and East/West (Mesh) traffic. | ||
|
|
||
| The API focuses on providing a consistent way to express observability intent, such as sampling rates for tracing, metric customization, and log filtering, regardless of the underlying data plane implementation. | ||
|
|
||
| ## Purpose (Why and Who) | ||
|
|
||
| ### The Fragmentation of Observability | ||
|
|
||
| In the current Kubernetes landscape, the "Who, What, Where, and How Long" of network traffic is answered differently depending on the underlying proxy technology. While the Gateway API specification has unified how traffic is routed via `HTTPRoute` and `Gateway`, it has deferred the standardization of how that traffic is observed. This deferral has led to "Observability Lock-in". Platform Engineering teams are forced to learn and manage distinct APIs for each environment. A standardized telemetry API is necessary to decouple the intent of observability from the implementation. Without such standardization it is difficult for platform owners to: | ||
|
|
||
| 1. Enforce consistent auditing standards across different infrastructure providers. | ||
| 2. Manage "Mesh" and "Gateway" observability with a single unified API. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As a proposal: I am happy to go with this GEP if we limit for now to the Gateway attachment, and do not consider the Mesh attachment. We can deal with Mesh attachment later as an extended feature of TelemetryPolicy. If we can for now focus on Gateway attachment only, this lgtm |
||
| 3. Support emerging workloads like AI Agents, which elevate the criticality of observability due to their autonomous, non-deterministic nature and requirements for specialized signals. | ||
|
|
||
| ### Who | ||
|
|
||
| - **Platform Operators**: Need to ensure uniform observability across all networking infrastructure. | ||
| - **Observability Teams**: Responsible for the governance of telemetry data. They need to define and enforce standardized schemas and collection policies across the entire organization. | ||
| - **Security/Auditing Teams**: Require a standardized audit trail for all traffic, especially for autonomous agent actions. | ||
|
gkhom marked this conversation as resolved.
Outdated
|
||
| - **Application Developers**: Benefit from consistent metrics and traces for debugging without worrying about the underlying mesh or gateway technology. | ||
|
|
||
| ## API | ||
|
gkhom marked this conversation as resolved.
|
||
|
|
||
| ### Policy Attachment vs. Inline Configuration | ||
|
gkhom marked this conversation as resolved.
|
||
|
|
||
| A key area of discussion for this GEP is whether this should be a standalone Policy Attachment (e.g., `TelemetryPolicy`) or inline configuration within `Gateway` and `Mesh` resources. | ||
|
|
||
| This proposal argues that the Policy Attachment model is the most effective approach to meet the stated goals, primarily for two reasons: | ||
|
gkhom marked this conversation as resolved.
|
||
|
|
||
| 1. **Separation of Concerns**: It allows different personas to manage Gateway infrastructure (the Platform Team) independently from the configuration of telemetry signals (the Observability team). | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In many organizations, the platform teams are also the observability team. I think this distinction happens mostly for larger organizations. However, if we were to compare policy attachment vs inline in HTTPRoute as a filter for example - this distinction between app developer and platform/observability teams would be more of a strong case IMO
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added my thoughts regarding this to the thread above: https://github.com/kubernetes-sigs/gateway-api/pull/4775/changes#r3171874793 |
||
| 2. **Fleet-Wide Uniformity**: It enables a single policy to be applied uniformly across a fleet of Gateways and Meshes, eliminating the need to duplicate complex telemetry configurations across individual resources. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I may understand this wrong, does this argues for a policy that is applied at Mesh and Gateway level without distinction in targetRef (i.e targetRef that encompass both meshes and gateways in it)?
gkhom marked this conversation as resolved.
Outdated
|
||
|
|
||
| To mitigate the challenge of complex merging semantics, this GEP restricts configuration such that only a single `TelemetryPolicy` can target a `Gateway` or `Mesh` at any given time. If multiple `TelemetryPolicy` resources target the same object, precedence is determined based on the creation timestamp. | ||
|
|
||
| ### High-level Considerations: | ||
|
|
||
| - **Tracing**: Configuration for OTLP endpoints, sampling rates (probabilistic and parent-based), and custom span attributes. | ||
| - **Metrics**: Ability to enable/disable specific metric families and customize dimensions (labels/attributes). | ||
| - **Access Logs**: Filtering for smart logging (e.g., only log 5xx errors or high latency), multi-protocol support, and log format customization (including field selection). | ||
|
|
||
|
gkhom marked this conversation as resolved.
|
||
| ## Request Flow | ||
|
|
||
| * A platform operator creates a `TelemetryPolicy` resource targeting a `Gateway` or `Mesh`. | ||
| * The Gateway API implementation reconciles this resource and configures the underlying data plane. | ||
| * The data plane extracts the specified signals and exports them to the telemetry infrastructure. | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,11 @@ | ||
| apiVersion: internal.gateway.networking.k8s.io/v1alpha1 | ||
| kind: GEPDetails | ||
| number: 4768 | ||
| name: Standardized Telemetry API | ||
| status: Provisional | ||
| authors: | ||
|
rikatz marked this conversation as resolved.
|
||
| - gkhom | ||
| seeAlso: | ||
| - https://github.com/kubernetes-sigs/kube-agentic-networking/pull/69 | ||
| - https://gateway-api.sigs.k8s.io/geps/gep-713/ | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.