Re-entrant apiserver calls are not part of the same trace #103186

dashpole · 2021-06-25T15:42:02Z

Part of enhancement: kubernetes/enhancements#647

This is a continuation of the discussion in #94942 (comment), which was a discussion of the risks of abuse that might be possible To summarize, the otelhttp.WithPublicEndpoint() option causes spans generated by the API Server handler to be the start of a new trace (with a link to the incoming trace), rather than be children of the incoming request.

Note: I think there is a bug in OpenTelemetry in which parent contexts are still used for sampling decisions: open-telemetry/opentelemetry-go#2031. I'll make sure that is fixed separately.

However, one of the use-cases outlined in the KEP is tracing reentrant API calls. Using WithPublicEndpoint make reentrant calls difficult to visualize, because they are each their own trace now. We should probably keep the default behavior as using WithPublicEndpoint, but expose an option to make it non-public. But first, we should discuss and document what the abuse implications are for an actor with the ability to send requests to the kubernetes API with a span context. The implications discussed thus far are:

If the actor has access to the Trace ID from another trace, they can append spans to it. This wouldn't impact any other spans in the trace, but would append the span to the trace.
The actor can set sampled=true in their request to cause the apiserver to record spans for all of their requests. They could potentially overwhelm a tracing backend, or cause other spans to be dropped.

One unanswered question from the thread from @liggitt:

@lavalamp, you've thought about the abuse vectors more w.r.t. priority and fairness... do you have thoughts on this?

The text was updated successfully, but these errors were encountered:

dashpole · 2021-06-25T15:42:48Z

/sig instrumentation
/sig apimachinery

cc @logicalhan @lilic

k8s-ci-robot · 2021-06-25T15:42:49Z

@dashpole: The label(s) sig/apimachinery cannot be applied, because the repository doesn't have them.

In response to this:

/sig instrumentation
/sig apimachinery

cc @logicalhan @lilic

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dashpole · 2021-06-25T15:43:07Z

/sig api-machinery

logicalhan · 2021-06-25T16:50:40Z

/assign

lavalamp · 2021-06-25T17:03:31Z

We should never be adding to user-controlled traces, IMO.

apiserver should add baggage to its trace headers. Rentrant calls should repeat the baggage. Then the 2nd apiserver can use this to verify that the trace was originated by some apiserver in the cluster. In that case, it should append to a trace. Otherwise, it should start a new one. This should not be an option for the user. There are multiple ways of constructing such a baggage value.

If you have a strong use case for a user starting a trace, let me know, but I don't think it's going to outweigh the anti-abuse perspective.

logicalhan · 2021-06-25T17:17:18Z

If you have a strong use case for a user starting a trace, let me know, but I don't think it's going to outweigh the anti-abuse perspective.

I'd like to be able to run kubectl apply -f blah.yaml --trace but I am indifferent as to whether that actually attaches a trace directly or runs through some webhook which validates who i am and then attaches a trace to the request.

lavalamp · 2021-06-25T17:46:23Z

I don't see how webhooks could attach a trace.

Letting users request apiserver initiate a trace is different than attaching a request to an already created trace, no?

If we let users request extra resource usage (trace allocation), then IMO we have to authz that somehow.

logicalhan · 2021-06-25T17:50:38Z

Maybe I'm phrasing this incorrectly. We can have some mechanism by which either kubectl attaches a trace header directly, or attaches some sort of proxy pseudo trace header which the apiserver can then, after authz-ing (via a webhook or whatever), decide to actually attach a trace to the request.

lavalamp · 2021-06-25T17:54:48Z

Maybe a preexisting trace makes sense if folks are using api calls. I guess my concern is just bounding the resource usage. A non-authz solution might look like a cap on the %age of requests that get traced, and user-requested traces count against that cap (and are ignored if it would put it over the cap). I don't think it works this way, but I definitely don't want apiserver dialing out to some user-controlled address to report on a trace.

…

On Fri, Jun 25, 2021 at 10:50 AM Han Kang ***@***.***> wrote: Maybe I'm phrasing this incorrectly. We can have some mechanism by which either kubectl attaches a trace header directly, or attaches some sort of proxy pseudo trace header which the apiserver can then, after authz-ing (via a webhook or whatever), decide to actually attach a trace to the request. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#103186 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAE6BFVBOREZFHM3R25YA6TTUS6XVANCNFSM47KDTUTA> .

logicalhan · 2021-06-25T17:59:04Z

I don't think it works this way, but I definitely don't want apiserver
dialing out to some user-controlled address to report on a trace.

Yeah that's not really what I was saying.

I guess my concern is just bounding the resource usage. A non-authz
solution might look like a cap on the %age of requests that get traced, and
user-requested traces count against that cap (and are ignored if it would
put it over the cap).

You would probably require separate quotas, generally trace sampling is super super low (think 1 in a hundred thousand requests), but yeah I'm a fan of quotas in general.

fedebongio · 2021-06-29T20:21:29Z

/triage accepted

k8s-triage-robot · 2021-09-27T20:52:55Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

dashpole · 2021-10-04T14:31:53Z

/remove-lifecycle stale

k8s-triage-robot · 2022-01-02T15:16:51Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-02-01T15:44:18Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

dashpole · 2022-06-01T17:06:39Z

/remove-lifecycle stale

k8s-triage-robot · 2022-08-30T17:41:53Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

logicalhan · 2022-09-22T16:53:02Z

/remove-lifecycle stale

k8s-triage-robot · 2022-12-21T17:16:14Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2023-01-20T18:16:28Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

dashpole · 2023-07-31T18:28:56Z

/remove-lifecycle rotten

dashpole · 2023-07-31T18:29:05Z

cc @richabanker

mterhar · 2023-08-03T20:13:59Z

Would an acceptable solution be to have an option for internal-api-endpoints? If I have network-boundary around my apiserver, then I trust incoming traceparent headers and would propagate the headers?

Including that as a command argument or part of the tracingconfiguration seem reasonable. If someone suggests a path that would be accepted, I'll happily take on implementing it.

dashpole · 2023-08-03T20:27:56Z

That is a reasonable approach. We need to think about:

Should this only exist in the apiserver config? Or is it applicable to all kubernetes components, and belong in component-base?
Do we need per-http-server configuration, or is a single option enough? IIRC, there was more than one http server instrumented for apiserver tracing.

dprotaso · 2023-10-13T20:59:30Z

A related use case I have is I want to debug some issues that appear in tests when my webhooks are invoked - they return a 200 response but the API server is failing and returning an EOF context timeout.

To debug this I would like the trace to start when the test makes the call to the API server. For this to be useful I would expect the API server to attach the trace context to outgoing requests.

If I could enable this cluster wide for testing/debugging purposes that would be great.

KubeKyrie · 2024-01-26T03:55:11Z

That is a reasonable approach. We need to think about:

Should this only exist in the apiserver config? Or is it applicable to all kubernetes components, and belong in component-base?

Do we need per-http-server configuration, or is a single option enough? IIRC, there was more than one http server instrumented for apiserver tracing.

I think @mterhar propose a good solution. Seems only exist in the apiserver config is ok at this stage. And both a single option and per-http-server are ok. maybe a single option is more generic?

dashpole · 2024-02-28T21:57:04Z

I have another option i'd like to consider, since I would like to avoid an additional config parameter:

If we move the tracing filter to after the authentication filter, we can ensure only authenticated requests are traced. This has the downside of not capturing latency incurred by authentication, but will fix this issue. I plan to work on a fix in OpenTelemetry to this issue: open-telemetry/opentelemetry-go#4993. But moving the filter to after authentication will unblock us for now, without the bad UX implications of not respecting the incoming trace context.

liggitt · 2024-02-28T22:00:24Z

If we move the tracing filter to after the authentication filter, we can ensure only authenticated requests are traced. This has the downside of not capturing latency incurred by authentication, but will fix this issue

does it? does that require treating all authenticated users as trustworthy and assuming they won't pollute each other's traces?

dashpole · 2024-02-29T02:32:51Z

does it? does that require treating all authenticated users as trustworthy and assuming they won't pollute each other's traces?

It assumes all authenticated users are not malicious (e.g. they aren't trying to DDOS the apiserver). Someone could, in theory, try to start traces with the same trace ID as another request to try to make the trace from the other request unusable. But they would need to get access to the trace ID in the header of the other request, which doesn't seem likely.

From https://www.w3.org/TR/trace-context-2/#security-considerations:

There are two types of potential security risks associated with this specification: information exposure and denial-of-service attacks against the vendor.

Information exposure applies to outgoing requests, so our primary concern is around denial-of-service, which is covered here: https://www.w3.org/TR/trace-context-2/#denial-of-service

When distributed tracing is enabled on a service with a public API and naively continues any trace with the sampled flag set, a malicious attacker could overwhelm an application with tracing overhead, forge trace-id collisions that make monitoring data unusable, or run up your tracing bill with your SaaS tracing vendor.

Tracing vendors and platforms should account for these situations and make sure that checks and balances are in place to protect denial of monitoring by malicious or badly authored callers.

One example of such protection may be different tracing behavior for authenticated and unauthenticated requests. Various rate limiters for data recording can also be implemented.

I am hoping to take the example approach, and do tracing only for authenticated requests. You mention "authenticated but untrusted" users here. We are able to restrict the ability to influence tracing based on the incoming request. Is there a way to tell if a request would fall into the "authenticated but untrusted" category?

dashpole added the kind/bug Categorizes issue or PR as related to a bug. label Jun 25, 2021

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 25, 2021

k8s-ci-robot added sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 25, 2021

k8s-ci-robot added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Jun 25, 2021

k8s-ci-robot assigned logicalhan Jun 25, 2021

This was referenced Jun 25, 2021

Add tracing to apiserver client-go requests #103218

Merged

API Server tracing kubernetes/enhancements#647

Open

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 29, 2021

dashpole mentioned this issue Jul 2, 2021

Add tracing to apiserver authentication and authorization requests #103443

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 27, 2021

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 4, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 2, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 1, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 1, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 30, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 22, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 21, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 20, 2023

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jul 31, 2023

dashpole mentioned this issue Jan 10, 2024

Add Tracing to the kubelet's read-only endpoint #109829

Closed

KubeKyrie mentioned this issue Jan 26, 2024

fix otel trace handler WithTracing, use another func clusterpedia-io/clusterpedia#638

Merged

This was referenced Feb 28, 2024

[WIP] Graduate APIServerTracing to stable #123566

Closed

Support making the parent/linking decision after span creation open-telemetry/opentelemetry-go#4993

Closed

This was referenced Mar 7, 2024

APIServerTracing: Respect trace context only for system:master #123807

Closed

Add PrivateEndpoint option to APIServerTracing configuration #123824

Closed

dashpole mentioned this issue Mar 24, 2024

otelhttp: allow setting the start timestamp using go Context open-telemetry/opentelemetry-go-contrib#5318

Open

dashpole linked a pull request Sep 1, 2024 that will close this issue

APIServerTracing: Respect trace context only for privileged users #127053

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-entrant apiserver calls are not part of the same trace #103186

Re-entrant apiserver calls are not part of the same trace #103186

dashpole commented Jun 25, 2021

dashpole commented Jun 25, 2021

k8s-ci-robot commented Jun 25, 2021

dashpole commented Jun 25, 2021

logicalhan commented Jun 25, 2021

lavalamp commented Jun 25, 2021

logicalhan commented Jun 25, 2021

lavalamp commented Jun 25, 2021

logicalhan commented Jun 25, 2021

lavalamp commented Jun 25, 2021 via email

logicalhan commented Jun 25, 2021

fedebongio commented Jun 29, 2021

k8s-triage-robot commented Sep 27, 2021

dashpole commented Oct 4, 2021

k8s-triage-robot commented Jan 2, 2022

k8s-triage-robot commented Feb 1, 2022

dashpole commented Jun 1, 2022

k8s-triage-robot commented Aug 30, 2022

logicalhan commented Sep 22, 2022

k8s-triage-robot commented Dec 21, 2022

k8s-triage-robot commented Jan 20, 2023

dashpole commented Jul 31, 2023

dashpole commented Jul 31, 2023

mterhar commented Aug 3, 2023

dashpole commented Aug 3, 2023

dprotaso commented Oct 13, 2023

KubeKyrie commented Jan 26, 2024

dashpole commented Feb 28, 2024

liggitt commented Feb 28, 2024

dashpole commented Feb 29, 2024

Re-entrant apiserver calls are not part of the same trace #103186

Re-entrant apiserver calls are not part of the same trace #103186

Comments

dashpole commented Jun 25, 2021

dashpole commented Jun 25, 2021

k8s-ci-robot commented Jun 25, 2021

dashpole commented Jun 25, 2021

logicalhan commented Jun 25, 2021

lavalamp commented Jun 25, 2021

logicalhan commented Jun 25, 2021

lavalamp commented Jun 25, 2021

logicalhan commented Jun 25, 2021

lavalamp commented Jun 25, 2021 via email

logicalhan commented Jun 25, 2021

fedebongio commented Jun 29, 2021

k8s-triage-robot commented Sep 27, 2021

dashpole commented Oct 4, 2021

k8s-triage-robot commented Jan 2, 2022

k8s-triage-robot commented Feb 1, 2022

dashpole commented Jun 1, 2022

k8s-triage-robot commented Aug 30, 2022

logicalhan commented Sep 22, 2022

k8s-triage-robot commented Dec 21, 2022

k8s-triage-robot commented Jan 20, 2023

dashpole commented Jul 31, 2023

dashpole commented Jul 31, 2023

mterhar commented Aug 3, 2023

dashpole commented Aug 3, 2023

dprotaso commented Oct 13, 2023

KubeKyrie commented Jan 26, 2024

dashpole commented Feb 28, 2024

liggitt commented Feb 28, 2024

dashpole commented Feb 29, 2024