Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FaaS: Change requirements regarding handling of AWS Lambda-provided SpanContext #272

Closed

Conversation

Aneurysm9
Copy link
Member

This change clarifies the requirements regarding the handling of the SpanContext vended by the AWS Lambda service to function invocations.

@Aneurysm9 Aneurysm9 requested review from a team August 18, 2023 19:12
@Oberon00
Copy link
Member

Please adapt the PR title, since this is more than "clarification"

Copy link
Contributor

@jsuereth jsuereth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally, I assumed the Link vs. Parent debate was one where the community wishes to move towards linking as the default, and we're providing configuration to aide in the transition for existing x-ray users.

This does not appear to be that direction, nor is this just a clarification of wording. Would like to hear from others in the FAAS group before allowing this to be merged.

[start options](https://github.com/open-telemetry/opentelemetry-specification/tree/v1.22.0/specification/trace/api.md#specifying-links) with an associated `String`-typed
attribute with a key of `source` and a value of `x-ray-env`, in order to indicate the source of the linked span.

Instrumentation MUST provide a mechanism for the user to select the behavior that they require it to utilize. In the absence of an expressed user preference
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence flips the previous default setting. Is that intended / agreed upon by the FAAS SiG?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the current state of most Lambda instrumentation. It is preserving the functioning behavior for users who had functioning instrumentation while clarifying that it is permissible to have an alternate behavior. I feel that this is a clarification as the spec as written permits this behavior and that was not apparent to all in the discussion during the last FaaS SIG meeting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Aneurysm9 Java and the semconv docs are not inline with this, and we would be indeed flipping the current default in this repo.

As said before: having this as default as direct parent will work great with X-Ray but will break the other vendors.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As said before: having this as default as direct parent will work great with X-Ray but will break the other vendors.

I do not believe that this is a correct interpretation of the situation. Having the default behavior to be to use the value as a parent SpanContext will preserve the existing behavior for for current users in all languages but Java. It will not break any users. Users who require a different behavior will be able to select that behavior.

Java and the semconv docs are not inline with this, and we would be indeed flipping the current default in this repo.

Java implemented a spec change that was incomplete and acknowledged to require changes. This is the semconv repo and this change is fixing the "semconv docs". In any event, this is "experimental" specification and subject to change. Is there a special process for such changes I'm unaware of that I should be following?

attribute with a key of `source` and a value of `x-ray-env`, in order to indicate the source of the linked span.

Instrumentation MUST provide a mechanism for the user to select the behavior that they require it to utilize. In the absence of an expressed user preference
instrumentation SHOULD utilize the extracted `Context` as the parent context for the `Invocation Span`. Instrumentation MAY require that the user explicitly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to allow explicit configuration as a requirement? I'd prefer if we could all agree on a default behavior here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, explicit configuration is a requirement. The Lambda-provided context does different things for different people and they need a mechanism for communicating their intent to the instrumentation.

@Aneurysm9
Copy link
Member Author

Please adapt the PR title, since this is more than "clarification"

I believe, as noted above, that this is a clarification of the intent behind the initial attempt to resolve open-telemetry/opentelemetry-specification#3060.

If you disagree, please provide alternate titles that would comport with your understanding of this change.


**Note**: When instrumenting a Java AWS Lambda, instrumentation SHOULD first try to parse an OpenTelemetry `Context` out of the system property `com.amazonaws.xray.traceHeader` using the [AWS X-Ray Propagator](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/context/api-propagators.md) before checking and attempting to parse the environment variable above.
* as the parent `Context`
* as a `SpanContext` associated with a [Span Link][] added to the `Invocation Span`'s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this SpanContext is used as a span link instead of the parent, it won't work with ParentBased samplers like the OTel default ParentBased(root=AlwaysOn).

Does AWS provide head sampling through this SpanContext?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Head sampling can be provided through this SpanContext. It is my understanding that that is irrelevant to the users who wish to have it associated as a link because they are not using X-Ray as their tracing backend and do not desire to have a trace including the Lambda-vended segments. I believe the expectation is that those users will obtain a SpanContext from another source, such as HTTP headers included in the invocation event, and would want to use that SpanContext for any sampling decisions.

@Oberon00
Copy link
Member

@Aneurysm9 Simplest adaption of PR title would be "FaaS: Clarify Change requirements regarding handling of AWS Lambda-provided SpanContext"

@Aneurysm9 Aneurysm9 changed the title FaaS: Clarify requirements regarding handling of AWS Lambda-provided SpanContext FaaS: Change requirements regarding handling of AWS Lambda-provided SpanContext Aug 21, 2023
Copy link
Contributor

@carlosalberto carlosalberto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Temporarily blocking as these changes are actually breaking the current semconv documents, changing once again the primary role of the context in the AWS specific environment variable.

@trask
Copy link
Member

trask commented Aug 21, 2023

Temporarily blocking as these changes are actually breaking the current semconv documents, changing once again the primary role of the context in the AWS specific environment variable.

@carlosalberto I don't personally agree with this change, but I also don't agree with blocking it for this reason, and I think we should let discussion continue to play out here and in open-telemetry/opentelemetry-specification#263

@Aneurysm9 my personal objection to this change is that it prioritizes the default experience for one monitoring backend (X-Ray) at the expense of the default experience for other monitoring backends, which doesn't feel in line with OpenTelemetry's vision. (btw I personally support an opt-in behavior to support X-Ray in this case)

@tylerbenson
Copy link
Member

tylerbenson commented Aug 21, 2023

I agree with @trask's objection to this change. While I think there are better solutions I'm not opposed to adding a setting to prefer the x-ray env context over remote context as long as the default is to use the remote. By default, traces reported via OTLP should be complete, regardless of if x-ray is enabled or not.
The only case where I would support the inverse default is if AWS made it trivial to export those internal x-ray spans through an otlp/collector pipeline. I've made such a request but have not received any response.

@Aneurysm9
Copy link
Member Author

my personal objection to this change is that it prioritizes the default experience for one monitoring backend (X-Ray) at the expense of the default experience for other monitoring backends, which doesn't feel in line with OpenTelemetry's vision. (btw I personally support an opt-in behavior to support X-Ray in this case)

From the vision document for clarity:

With OpenTelemetry, we strive to provide a level playing field for all observability providers, avoid lock-in to any vendor, and interoperate with other OSS projects in the telemetry and observability ecosystem.

I believe that there is an expectation inherent in this objection that using the AWS Lambda vended spans as a parent context is only relevant to users of AWS X-Ray. This is not the case. It is possible to utilize the AWS X-Ray trace context propagation format without using AWS X-Ray as a trace backend. It is possible to do so in a way that would cause AWS Lambda to participate in that trace, vending segments to AWS X-Ray that can be obtained by the user and migrated to the trace backend of their choice.

There is nothing in this change that would result in any user being locked to any vendor. There is nothing in this change that would preclude interoperability with any other system.

The reasons I feel it is appropriate for the default behavior to be to utilize the AWS Lambda vended context as a parent are that 1) this is the current behavior for all AWS Lambda instrumentation with the exception of the Java instrumentation and 2) this is instrumentation of an AWS service and it is reasonable to utilize the service-provided context propagation mechanism as the default.

@tylerbenson
Copy link
Member

@Aneurysm9 the issue here isn't the propagation mechanism. X-Ray propagation is fine and I even submitted spec changes to formalize that further. Our issue from the beginning has always been the forked context that passes through AWS services, creating spans that aren't reported in the same manner (to the same backend) as the rest of the spans within the trace. Such behavior results in broken traces which is unacceptable.

@trask
Copy link
Member

trask commented Aug 21, 2023

I believe that there is an expectation inherent in this objection that using the AWS Lambda vended spans as a parent context is only relevant to users of AWS X-Ray. This is not the case. It is possible to utilize the AWS X-Ray trace context propagation format without using AWS X-Ray as a trace backend. It is possible to do so in a way that would cause AWS Lambda to participate in that trace, vending segments to AWS X-Ray that can be obtained by the user and migrated to the trace backend of their choice.

@Aneurysm9 I'd like to understand this better. Could you link to docs for how e.g. a Jaeger user would set this up? thx

@Aneurysm9
Copy link
Member Author

@Aneurysm9 the issue here isn't the propagation mechanism. X-Ray propagation is fine and I even submitted spec changes to formalize that further. Our issue from the beginning has always been the forked context that passes through AWS services, creating spans that aren't reported in the same manner (to the same backend) as the rest of the spans within the trace. Such behavior results in broken traces which is unacceptable.

Now I'm confused. Is the issue the propagation mechanism, or is it not the propagation mechanism? Are broken traces acceptable or not?

Is this situation (a distributed trace passes through a system that is not in control of the application owner, resulting in spans reported to a trace backend that is not in the control of the application owner) not common to all distributed tracing backends and instrumentation? This is such a common situation that the specification requires that APIs must accept "[t]he parent Context or an indication that the new Span should be a root Span" at span creation to allow for this.

I also do not believe that there has ever been a situation where users of tracing backends other than AWS X-Ray have been wholly unable to create complete traces for interactions involving AWS Lambda function invocations. To the extent that their preferred parent context is carried by the invocation event and they have access to a TracerProvider and their desired context propagator, they are able to extract that context from the event and create a new span using it as the parent context. I'll allow that this is probably not the ideal user experience, but it is inaccurate to state otherwise.

@carlosalberto
Copy link
Contributor

(Removing my changes request as per @trask's feedback)

@carlosalberto carlosalberto self-requested a review August 21, 2023 20:06
@jsuereth
Copy link
Contributor

IIUC - The key user issue we are debating (and the one we're arguing to prioritize) is as follows:

Scenario 1

  • Service A uses OTEL and propagates W3c headers.
  • Service A makes a request through an AWS service where x-ray spans will be created, but also where the original OTEL span headers are propagated.
  • Service B (living in an AWS environment) wants to have tracing that leverages both the spans from AWS x-ray and Service A.

How should we attach traces to give users the best o11y here?

A few things I'm assuming:

  • AWS x-ray will create a new trace_id, either because of security boundaries or just how the product works. (Note GCP does the same for Cloud Run / Cloud Functions). This is the issue that we're fixing
  • Users want the ability to see the WHOLE trace, meaning Service A, AWS x-ray internal trace and Service B.

In this case, the ONLY way to not have a broken trace is to "link" against the AWS x-ray trace vs. parent against it. I believe this SHOULD be the default, including apply this to GCP services that have the same problem. This allows users to define their own "security boundaries" of what propagation they allow without being bound to what we, as cloud providers, require for trace_id generation.

Scenario 2

There are users who will leverage OOTB trace sampling and generation of their cloud provider. GCP, e.g. provides tracing and sampling of Cloud Run as a free option. Users may want to track their traces at this "entry point" level. This works when a trace is not going through one of these security / trace_id ownership boundaries between services for that user. I agree there should be a way for users to get what they need here.

The big question

Which scenario should be the default / more likely for users. It was my understanding that Scenario 1 would be the new default behavior, with Scenario 2 requiring user configuration, but achievable.

A few comments from the linked issue that lead to me raising concerns:

Instead of having this behavior where the x-ray propagation ignores the context from the parent if the environment variable is set, it should always use the context from the event as the parent and link to the environment variable context if present. I'm not sure what implications this would have on x-ray though.

Per discussion in the FAAS SIG, we decided that the aws x-ray environment variable should be moved to a span link to avoid interfering with the configured propagators.

This last one coming from this pr, which was approved and merged.

@Aneurysm9
Copy link
Member Author

I agree with @trask's objection to this change. While I think there are better solutions I'm not opposed to adding a setting to prefer the x-ray env context over remote context as long as the default is to use the remote.

The context provided by AWS Lambda in the execution environment is a remote context. The question is which of the remote contexts to use when multiple are present and is complicated by the fact that the specification recommends specific sources of context carriers be used. I have tried to minimize the change to the current specification here to avoid getting into issues of separation of responsibilities between instrumentation author, instrumentation user, instrumentation implementations, context propagators, and user applications. I'm not opposed to having that discussion again as long as we stop pushing changes that will break users for whom the instrumentation is presently functioning.

By default, traces reported via OTLP should be complete, regardless of if x-ray is enabled or not. The only case where I would support the inverse default is if AWS made it trivial to export those internal x-ray spans through an otlp/collector pipeline. I've made such a request but have not received any response.

I'm not sure where OTLP entered the picture here and this is not something that any instrumentation can guarantee precisely because distributed tracing involves interacting with systems outside the control of the instrumented system. Regardless which serialization format and network protocol is used to communicate trace data an incomplete trace will be experienced if one participant in the trace sends data to one backend and another participant sends data to a different backend.

I will let the AWS Lambda team handle your product feature request, but I will note here for posterity that it was made five days ago in a repository relevant to container images useful to AWS Lambda users who use the container packaging format for their functions. This is likely not the correct forum for such a product feature request.

@trask
Copy link
Member

trask commented Aug 21, 2023

A few things I'm assuming:

  • AWS x-ray will create a new trace_id, either because of security boundaries or just how the product works. (Note GCP does the same for Cloud Run / Cloud Functions). This is the issue that we're fixing

just checking my maths, doesn't the same problem exist even if the same traceId is used, since you still have to pick one of the spanIds to parent?

@jsuereth
Copy link
Contributor

@trask That's true. However if the same trace_id is used, I think it's less likely to cause broken traces, but would cause missing spans. Depends on the backend implemetnation at that point.

However, we're starting to sidestep the main point. Instead of repeating, I'll link my more succinct comment: #263 (comment)

@tylerbenson
Copy link
Member

I will let the AWS Lambda team handle your product feature request, but I will note here for posterity that it was made five days ago in a repository relevant to container images useful to AWS Lambda users who use the container packaging format for their functions. This is likely not the correct forum for such a product feature request.

@Aneurysm9 I understand that is likely not the most relevant location for such but it was the place that @rapphil recommended I provide such feedback and it seemed like a reasonable place to start such a conversation in public. If you have a better forum for that conversation, let me know and I'll be happy to take that suggestion elsewhere.
As for the timing, I added it after the recent SIG meeting as an alternative solution that seems like the only possible avenue for a win-win at this point as you can't seem to understand why other vendors are not happy with the default you're demanding.

@Aneurysm9
Copy link
Member Author

as you can't seem to understand why other vendors are not happy with the default you're demanding.

Please do not ascribe "demand"s to me that I have not made and please do not make assertions about my mental state that you cannot know.

I have indicated that the suggestion on open-telemetry/opentelemetry-specification#263 does not work for me, for a variety of reasons that have since been elaborated, and I have suggested an alternative.

@pyohannes
Copy link
Contributor

  • Service A uses OTEL and propagates W3c headers.
  • Service A makes a request through an AWS service where x-ray spans will be created, but also where the original OTEL span headers are propagated.
  • Service B (living in an AWS environment) wants to have tracing that leverages both the spans from AWS x-ray and Service A.

For the case where the AWS service in question is a messaging system (SNS, SQS), this maps to scenarios extensively discussed in the messaging SIG, results of this discussion are captured in OTEP 220.

TLDR; We always recommend to create a link between Service A and Service B, which denotes a producer/consumer relationship. Depending on variations of the scenarios (batch-processing, or broker instrumentation) there can or cannot be links between A and B, or between A, B, and the broker. However, the link between A and B is an invariant for all scenarios.

I'm not familiar with all details of this discussion, so I might be a bit off, however, it seems to me that it is closely related to issue #652. In my opinion, the question of parent vs link is not so much related to propagation as it is related to the question of how traces should look like for the user.

@Oberon00
Copy link
Member

. Our issue from the beginning has always been the forked context that passes through AWS services, creating spans that aren't reported in the same manner (to the same backend) as the rest of the spans within the trace. Such behavior results in broken traces which is unacceptable.

@tylerbenson While we have had similar troubles with this, it is not really an AWS problem, but a problem that in general will be even more widespread when cloud providers adapt W3C TraceContext headers. Then you won't even be able to switch to W3C propagation as a workaround. You can observe this when trying to monitor google cloud functions. Google, fully W3C conformant, participates in the trace an will thus override your span ID. Multi-tenancy is simply a missing feature from OTel and also seems to be somehow missed in the W3C spec (for a potential solution see open-telemetry/opentelemetry-specification#366 (comment))

@tylerbenson
Copy link
Member

@Oberon00 I agree and for that reason I suggest we work with cloud providers for easier ways to get access to those internal spans. Last week I submitted an issue on AWS's side: aws/aws-lambda-base-images#111 to see if I could get any reaction from them that doesn't involve just polling their API.

@jsuereth
Copy link
Contributor

FYI - TC discussion and opinion on this PR: #277 (comment)

@Aneurysm9 Aneurysm9 closed this Aug 23, 2023
@Aneurysm9 Aneurysm9 deleted the fix/lambdaSpanContext branch August 23, 2023 21:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants