Skip to content

gep-4768: Standardized Telemetry API (provisional)#4775

Merged
k8s-ci-robot merged 8 commits into
kubernetes-sigs:mainfrom
gkhom:main
May 7, 2026
Merged

gep-4768: Standardized Telemetry API (provisional)#4775
k8s-ci-robot merged 8 commits into
kubernetes-sigs:mainfrom
gkhom:main

Conversation

@gkhom
Copy link
Copy Markdown
Contributor

@gkhom gkhom commented Apr 21, 2026

What type of PR is this?

/kind gep

What this PR does / why we need it:

This GEP proposes a standardized, provider-agnostic Telemetry API to configure observability signals for Gateways and Meshes.

Which issue(s) this PR fixes:

Fixes #4768

Does this PR introduce a user-facing change?:

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/gep PRs related to Gateway Enhancement Proposal(GEP) do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Apr 21, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Welcome @gkhom!

It looks like this is your first PR to kubernetes-sigs/gateway-api 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/gateway-api has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 21, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @gkhom. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 21, 2026
Comment thread geps/gep-4768/index.md
Comment thread geps/gep-4768/index.md Outdated
Comment thread geps/gep-4768/index.md Outdated
Comment thread geps/gep-4768/index.md
Comment thread geps/gep-4768/index.md Outdated
Comment thread geps/gep-4768/metadata.yaml
@sjberman
Copy link
Copy Markdown
Contributor

This was touched on a little bit by @rikatz, but one of the concerns we had when originally designing our own tracing policy was where to attach it. We chose the Route level instead of the Gateway level. Tracing can be an expensive operation, and in many cases it's used as a debugging tool for specific request flows, not necessarily the entire proxy (though sometimes it could be). For that reason, we wanted to be more granular so that the Policy could attach just to desired Routes, and not cause the overhead of potentially tracing every request across the entire Gateway. Though we did consider the possibility of an Inherited Policy that could attach to either a Gateway or a Route, but we haven't implemented that.

@howardjohn
Copy link
Copy Markdown
Contributor

This was touched on a little bit by @rikatz, but one of the concerns we had when originally designing our own tracing policy was where to attach it. We chose the Route level instead of the Gateway level. Tracing can be an expensive operation, and in many cases it's used as a debugging tool for specific request flows, not necessarily the entire proxy (though sometimes it could be). For that reason, we wanted to be more granular so that the Policy could attach just to desired Routes, and not cause the overhead of potentially tracing every request across the entire Gateway. Though we did consider the possibility of an Inherited Policy that could attach to either a Gateway or a Route, but we haven't implemented that.

FWIW one downside of this approach is that any traffic that fails before a route can be matched would (presumably?) be dropped. May not be enough to warrant not doing this, just something to keep in mind. Gateway + Route policies can mitigate this a bit since you could have a default that is overridden

Copy link
Copy Markdown
Member

@LiorLieberman LiorLieberman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing this, plus 1 to the need for standardize it @gkhom

Comment thread geps/gep-4768/index.md Outdated
Comment thread geps/gep-4768/index.md Outdated
## Goals

1. Establish a standardized model to configure provider-agnostic telemetry (metrics, access logs, and traces) for both Gateway and Mesh.
2. Enable separation of concerns between the persona managing networking infrastructure (Platform Team) and the persona governing telemetry signals (Observability/Security Team).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this a goal? i.e is there an existing pain point where we have a current solution that does not distinguish between responsibilities of this personas wrt to Telemetry?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My interpretation of the three roles and personas defined by the Gateway API is that they do not explicitly recognize that the persona responsible for gateway infrastructure might not be the same persona that is responsible for delivering consistent telemetry signals. So the reason to state it as a goal is to emphasize that this separation exists in (often larger) companies. If we do not want to account this use-case, I can remove this goal.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a +1 of removing this goal for now. We must consider the Gateway owner the responsible for the telemetry policy at this moment.

Comment thread geps/gep-4768/index.md Outdated
Comment thread geps/gep-4768/index.md Outdated

This proposal argues that the Policy Attachment model is the most effective approach to meet the stated goals, primarily for two reasons:

1. **Separation of Concerns**: It allows different personas to manage Gateway infrastructure (the Platform Team) independently from the configuration of telemetry signals (the Observability team).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In many organizations, the platform teams are also the observability team. I think this distinction happens mostly for larger organizations.

However, if we were to compare policy attachment vs inline in HTTPRoute as a filter for example - this distinction between app developer and platform/observability teams would be more of a strong case IMO

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread geps/gep-4768/index.md Outdated
This proposal argues that the Policy Attachment model is the most effective approach to meet the stated goals, primarily for two reasons:

1. **Separation of Concerns**: It allows different personas to manage Gateway infrastructure (the Platform Team) independently from the configuration of telemetry signals (the Observability team).
2. **Fleet-Wide Uniformity**: It enables a single policy to be applied uniformly across a fleet of Gateways and Meshes, eliminating the need to duplicate complex telemetry configurations across individual resources.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may understand this wrong, does this argues for a policy that is applied at Mesh and Gateway level without distinction in targetRef (i.e targetRef that encompass both meshes and gateways in it)?

Comment thread geps/gep-4768/index.md Outdated
Comment thread geps/gep-4768/index.md
Comment thread geps/gep-4768/index.md
Comment thread geps/gep-4768/index.md
gkhom and others added 2 commits April 30, 2026 19:34
Co-authored-by: Lior Lieberman <liorlib7+riskified@gmail.com>
Comment thread geps/gep-4768/index.md Outdated
In the current Kubernetes landscape, the "Who, What, Where, and How Long" of network traffic is answered differently depending on the underlying proxy technology. While the Gateway API specification has unified how traffic is routed via `HTTPRoute` and `Gateway`, it has deferred the standardization of how that traffic is observed. This deferral has led to "Observability Lock-in". Platform Engineering teams are forced to learn and manage distinct APIs for each environment. A standardized telemetry API is necessary to decouple the intent of observability from the implementation. Without such standardization it is difficult for platform owners to:

1. Enforce consistent auditing standards across different infrastructure providers.
2. Manage "Mesh" and "Gateway" observability with a single unified API.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a proposal: I am happy to go with this GEP if we limit for now to the Gateway attachment, and do not consider the Mesh attachment.

We can deal with Mesh attachment later as an extended feature of TelemetryPolicy. If we can for now focus on Gateway attachment only, this lgtm

@rikatz
Copy link
Copy Markdown
Member

rikatz commented May 4, 2026

@gkhom I am ok with the proposal if we reduce its scope:

  • Let's not consider a cluster scoped policy for now, as this needs a different approach from policy attachment perspective and may slow down things
  • Let's make Mesh attachment its own feature and its own GEP once we move forward with TelemetryPolicy.

So, long story short, if we can reduce the scope of this proposal to TelemetryPolicy being a policy attached to a Gateway, I think we can move.

@youngnick for your consideration as well

@youngnick
Copy link
Copy Markdown
Contributor

Agreed, if we scope this to Gateway first, then I think we are good to go. We can come back and discuss expanding that scope once we have some more examples and can understand better what that would look like.

@rikatz
Copy link
Copy Markdown
Member

rikatz commented May 5, 2026

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 5, 2026
Scoped the proposal down to Gateway only (no Mesh) and addressing the remaining feedback.
Comment thread geps/gep-4768/index.md
Copy link
Copy Markdown
Member

@LiorLieberman LiorLieberman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me and aligned with the feedback and the scope that was mentioned above and discussed in the OSS meetings.

Left a few small language nits.

Thanks @gkhom for iterating on this!

Comment thread geps/gep-4768/index.md Outdated
Comment thread geps/gep-4768/index.md Outdated

This proposal argues that the Policy Attachment model is the most effective approach to meet the stated goals, primarily for two reasons:

1. **Separation of Concerns**: It allows different personas to manage Gateway infrastructure independently from the configuration of telemetry signals.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. **Separation of Concerns**: It allows different personas to manage Gateway infrastructure independently from the configuration of telemetry signals.
1. **Separation of Concerns**: When we reason about telemetry, it is commonly not the app developer that activates/sets telemetry config, its more common its the platform engineer, or in a littler larger organizations - the observability engineer or security engineer. Therefore, HTTPRoute is likely not the direction we want to start with. Similarly, It allows different personas to manage Gateway infrastructure independently from the configuration of telemetry signals.

Comment thread geps/gep-4768/index.md Outdated
Comment thread geps/gep-4768/index.md
@LiorLieberman
Copy link
Copy Markdown
Member

Also - can you fix the verify? (run hack/verify-all and fix based on the output)

@rikatz
Copy link
Copy Markdown
Member

rikatz commented May 7, 2026

/ok-to-test

@rikatz
Copy link
Copy Markdown
Member

rikatz commented May 7, 2026

the gep lgtm, thanks.

I will leave my approval and the "lgtm" for @LiorLieberman .

Also we need the metadata.yml file to be fixed according to the linter error:

 ./geps/gep-4768/metadata.yaml
  9:2       error    wrong indentation: expected 2 but found 1  (indentation)

@rikatz
Copy link
Copy Markdown
Member

rikatz commented May 7, 2026

/approve

letting the lgtm for Lior

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gkhom, LiorLieberman, rikatz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 7, 2026
@LiorLieberman
Copy link
Copy Markdown
Member

/lgtm
Thanks!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 7, 2026
@LiorLieberman
Copy link
Copy Markdown
Member

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 7, 2026
@k8s-ci-robot k8s-ci-robot merged commit 76c97f7 into kubernetes-sigs:main May 7, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/gep PRs related to Gateway Enhancement Proposal(GEP) lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

EXP: Standardized Telemetry API

9 participants