gep-4768: Standardized Telemetry API (provisional)#4775
Conversation
|
Welcome @gkhom! |
|
Hi @gkhom. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
This was touched on a little bit by @rikatz, but one of the concerns we had when originally designing our own tracing policy was where to attach it. We chose the Route level instead of the Gateway level. Tracing can be an expensive operation, and in many cases it's used as a debugging tool for specific request flows, not necessarily the entire proxy (though sometimes it could be). For that reason, we wanted to be more granular so that the Policy could attach just to desired Routes, and not cause the overhead of potentially tracing every request across the entire Gateway. Though we did consider the possibility of an Inherited Policy that could attach to either a Gateway or a Route, but we haven't implemented that. |
FWIW one downside of this approach is that any traffic that fails before a route can be matched would (presumably?) be dropped. May not be enough to warrant not doing this, just something to keep in mind. Gateway + Route policies can mitigate this a bit since you could have a default that is overridden |
LiorLieberman
left a comment
There was a problem hiding this comment.
Thanks for writing this, plus 1 to the need for standardize it @gkhom
| ## Goals | ||
|
|
||
| 1. Establish a standardized model to configure provider-agnostic telemetry (metrics, access logs, and traces) for both Gateway and Mesh. | ||
| 2. Enable separation of concerns between the persona managing networking infrastructure (Platform Team) and the persona governing telemetry signals (Observability/Security Team). |
There was a problem hiding this comment.
why is this a goal? i.e is there an existing pain point where we have a current solution that does not distinguish between responsibilities of this personas wrt to Telemetry?
There was a problem hiding this comment.
My interpretation of the three roles and personas defined by the Gateway API is that they do not explicitly recognize that the persona responsible for gateway infrastructure might not be the same persona that is responsible for delivering consistent telemetry signals. So the reason to state it as a goal is to emphasize that this separation exists in (often larger) companies. If we do not want to account this use-case, I can remove this goal.
There was a problem hiding this comment.
I am a +1 of removing this goal for now. We must consider the Gateway owner the responsible for the telemetry policy at this moment.
|
|
||
| This proposal argues that the Policy Attachment model is the most effective approach to meet the stated goals, primarily for two reasons: | ||
|
|
||
| 1. **Separation of Concerns**: It allows different personas to manage Gateway infrastructure (the Platform Team) independently from the configuration of telemetry signals (the Observability team). |
There was a problem hiding this comment.
In many organizations, the platform teams are also the observability team. I think this distinction happens mostly for larger organizations.
However, if we were to compare policy attachment vs inline in HTTPRoute as a filter for example - this distinction between app developer and platform/observability teams would be more of a strong case IMO
There was a problem hiding this comment.
Added my thoughts regarding this to the thread above: https://github.com/kubernetes-sigs/gateway-api/pull/4775/changes#r3171874793
| This proposal argues that the Policy Attachment model is the most effective approach to meet the stated goals, primarily for two reasons: | ||
|
|
||
| 1. **Separation of Concerns**: It allows different personas to manage Gateway infrastructure (the Platform Team) independently from the configuration of telemetry signals (the Observability team). | ||
| 2. **Fleet-Wide Uniformity**: It enables a single policy to be applied uniformly across a fleet of Gateways and Meshes, eliminating the need to duplicate complex telemetry configurations across individual resources. |
There was a problem hiding this comment.
I may understand this wrong, does this argues for a policy that is applied at Mesh and Gateway level without distinction in targetRef (i.e targetRef that encompass both meshes and gateways in it)?
Co-authored-by: Lior Lieberman <liorlib7+riskified@gmail.com>
| In the current Kubernetes landscape, the "Who, What, Where, and How Long" of network traffic is answered differently depending on the underlying proxy technology. While the Gateway API specification has unified how traffic is routed via `HTTPRoute` and `Gateway`, it has deferred the standardization of how that traffic is observed. This deferral has led to "Observability Lock-in". Platform Engineering teams are forced to learn and manage distinct APIs for each environment. A standardized telemetry API is necessary to decouple the intent of observability from the implementation. Without such standardization it is difficult for platform owners to: | ||
|
|
||
| 1. Enforce consistent auditing standards across different infrastructure providers. | ||
| 2. Manage "Mesh" and "Gateway" observability with a single unified API. |
There was a problem hiding this comment.
As a proposal: I am happy to go with this GEP if we limit for now to the Gateway attachment, and do not consider the Mesh attachment.
We can deal with Mesh attachment later as an extended feature of TelemetryPolicy. If we can for now focus on Gateway attachment only, this lgtm
|
@gkhom I am ok with the proposal if we reduce its scope:
So, long story short, if we can reduce the scope of this proposal to TelemetryPolicy being a policy attached to a Gateway, I think we can move. @youngnick for your consideration as well |
|
Agreed, if we scope this to Gateway first, then I think we are good to go. We can come back and discuss expanding that scope once we have some more examples and can understand better what that would look like. |
|
/ok-to-test |
Scoped the proposal down to Gateway only (no Mesh) and addressing the remaining feedback.
LiorLieberman
left a comment
There was a problem hiding this comment.
Looks good to me and aligned with the feedback and the scope that was mentioned above and discussed in the OSS meetings.
Left a few small language nits.
Thanks @gkhom for iterating on this!
|
|
||
| This proposal argues that the Policy Attachment model is the most effective approach to meet the stated goals, primarily for two reasons: | ||
|
|
||
| 1. **Separation of Concerns**: It allows different personas to manage Gateway infrastructure independently from the configuration of telemetry signals. |
There was a problem hiding this comment.
| 1. **Separation of Concerns**: It allows different personas to manage Gateway infrastructure independently from the configuration of telemetry signals. | |
| 1. **Separation of Concerns**: When we reason about telemetry, it is commonly not the app developer that activates/sets telemetry config, its more common its the platform engineer, or in a littler larger organizations - the observability engineer or security engineer. Therefore, HTTPRoute is likely not the direction we want to start with. Similarly, It allows different personas to manage Gateway infrastructure independently from the configuration of telemetry signals. |
|
Also - can you fix the verify? (run |
|
/ok-to-test |
|
the gep lgtm, thanks. I will leave my approval and the "lgtm" for @LiorLieberman . Also we need the metadata.yml file to be fixed according to the linter error: |
|
/approve letting the lgtm for Lior |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: gkhom, LiorLieberman, rikatz The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Co-authored-by: Lior Lieberman <liorlib7+riskified@gmail.com>
|
/lgtm |
|
/unhold |
What type of PR is this?
/kind gep
What this PR does / why we need it:
This GEP proposes a standardized, provider-agnostic Telemetry API to configure observability signals for Gateways and Meshes.
Which issue(s) this PR fixes:
Fixes #4768
Does this PR introduce a user-facing change?: