FaaS: Change requirements regarding handling of AWS Lambda-provided `SpanContext` #272

Aneurysm9 · 2023-08-18T19:12:06Z

This change clarifies the requirements regarding the handling of the SpanContext vended by the AWS Lambda service to function invocations.

…`SpanContext Signed-off-by: Anthony J Mirabella <[email protected]>

Oberon00 · 2023-08-21T11:51:53Z

Please adapt the PR title, since this is more than "clarification"

jsuereth

Originally, I assumed the Link vs. Parent debate was one where the community wishes to move towards linking as the default, and we're providing configuration to aide in the transition for existing x-ray users.

This does not appear to be that direction, nor is this just a clarification of wording. Would like to hear from others in the FAAS group before allowing this to be merged.

jsuereth · 2023-08-21T14:15:21Z

docs/faas/aws-lambda.md

+[start options](https://github.com/open-telemetry/opentelemetry-specification/tree/v1.22.0/specification/trace/api.md#specifying-links) with an associated `String`-typed
+attribute with a key of `source` and a value of `x-ray-env`, in order to indicate the source of the linked span.
+
+Instrumentation MUST provide a mechanism for the user to select the behavior that they require it to utilize.  In the absence of an expressed user preference


This sentence flips the previous default setting. Is that intended / agreed upon by the FAAS SiG?

This is the current state of most Lambda instrumentation. It is preserving the functioning behavior for users who had functioning instrumentation while clarifying that it is permissible to have an alternate behavior. I feel that this is a clarification as the spec as written permits this behavior and that was not apparent to all in the discussion during the last FaaS SIG meeting.

@Aneurysm9 Java and the semconv docs are not inline with this, and we would be indeed flipping the current default in this repo.

As said before: having this as default as direct parent will work great with X-Ray but will break the other vendors.

As said before: having this as default as direct parent will work great with X-Ray but will break the other vendors.

I do not believe that this is a correct interpretation of the situation. Having the default behavior to be to use the value as a parent SpanContext will preserve the existing behavior for for current users in all languages but Java. It will not break any users. Users who require a different behavior will be able to select that behavior.

Java and the semconv docs are not inline with this, and we would be indeed flipping the current default in this repo.

Java implemented a spec change that was incomplete and acknowledged to require changes. This is the semconv repo and this change is fixing the "semconv docs". In any event, this is "experimental" specification and subject to change. Is there a special process for such changes I'm unaware of that I should be following?

jsuereth · 2023-08-21T14:15:56Z

docs/faas/aws-lambda.md

+attribute with a key of `source` and a value of `x-ray-env`, in order to indicate the source of the linked span.
+
+Instrumentation MUST provide a mechanism for the user to select the behavior that they require it to utilize.  In the absence of an expressed user preference
+instrumentation SHOULD utilize the extracted `Context` as the parent context for the `Invocation Span`.  Instrumentation MAY require that the user explicitly


Do we need to allow explicit configuration as a requirement? I'd prefer if we could all agree on a default behavior here.

Yes, explicit configuration is a requirement. The Lambda-provided context does different things for different people and they need a mechanism for communicating their intent to the instrumentation.

Aneurysm9 · 2023-08-21T14:53:27Z

Please adapt the PR title, since this is more than "clarification"

I believe, as noted above, that this is a clarification of the intent behind the initial attempt to resolve open-telemetry/opentelemetry-specification#3060.

If you disagree, please provide alternate titles that would comport with your understanding of this change.

aabmass · 2023-08-21T15:18:43Z

docs/faas/aws-lambda.md


-**Note**: When instrumenting a Java AWS Lambda, instrumentation SHOULD first try to parse an OpenTelemetry `Context` out of the system property `com.amazonaws.xray.traceHeader` using the [AWS X-Ray Propagator](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/context/api-propagators.md) before checking and attempting to parse the environment variable above.
+* as the parent `Context`
+* as a `SpanContext` associated with a [Span Link][] added to the `Invocation Span`'s


If this SpanContext is used as a span link instead of the parent, it won't work with ParentBased samplers like the OTel default ParentBased(root=AlwaysOn).

Does AWS provide head sampling through this SpanContext?

Head sampling can be provided through this SpanContext. It is my understanding that that is irrelevant to the users who wish to have it associated as a link because they are not using X-Ray as their tracing backend and do not desire to have a trace including the Lambda-vended segments. I believe the expectation is that those users will obtain a SpanContext from another source, such as HTTP headers included in the invocation event, and would want to use that SpanContext for any sampling decisions.

Oberon00 · 2023-08-21T15:20:50Z

@Aneurysm9 Simplest adaption of PR title would be "FaaS: ~~Clarify~~ Change requirements regarding handling of AWS Lambda-provided SpanContext"

… into fix/lambdaSpanContext

Signed-off-by: Anthony J Mirabella <[email protected]>

carlosalberto

Temporarily blocking as these changes are actually breaking the current semconv documents, changing once again the primary role of the context in the AWS specific environment variable.

trask · 2023-08-21T17:56:30Z

Temporarily blocking as these changes are actually breaking the current semconv documents, changing once again the primary role of the context in the AWS specific environment variable.

@carlosalberto I don't personally agree with this change, but I also don't agree with blocking it for this reason, and I think we should let discussion continue to play out here and in open-telemetry/opentelemetry-specification#263

@Aneurysm9 my personal objection to this change is that it prioritizes the default experience for one monitoring backend (X-Ray) at the expense of the default experience for other monitoring backends, which doesn't feel in line with OpenTelemetry's vision. (btw I personally support an opt-in behavior to support X-Ray in this case)

tylerbenson · 2023-08-21T18:11:33Z

I agree with @trask's objection to this change. While I think there are better solutions I'm not opposed to adding a setting to prefer the x-ray env context over remote context as long as the default is to use the remote. By default, traces reported via OTLP should be complete, regardless of if x-ray is enabled or not.
The only case where I would support the inverse default is if AWS made it trivial to export those internal x-ray spans through an otlp/collector pipeline. I've made such a request but have not received any response.

Aneurysm9 · 2023-08-21T18:27:50Z

my personal objection to this change is that it prioritizes the default experience for one monitoring backend (X-Ray) at the expense of the default experience for other monitoring backends, which doesn't feel in line with OpenTelemetry's vision. (btw I personally support an opt-in behavior to support X-Ray in this case)

From the vision document for clarity:

With OpenTelemetry, we strive to provide a level playing field for all observability providers, avoid lock-in to any vendor, and interoperate with other OSS projects in the telemetry and observability ecosystem.

I believe that there is an expectation inherent in this objection that using the AWS Lambda vended spans as a parent context is only relevant to users of AWS X-Ray. This is not the case. It is possible to utilize the AWS X-Ray trace context propagation format without using AWS X-Ray as a trace backend. It is possible to do so in a way that would cause AWS Lambda to participate in that trace, vending segments to AWS X-Ray that can be obtained by the user and migrated to the trace backend of their choice.

There is nothing in this change that would result in any user being locked to any vendor. There is nothing in this change that would preclude interoperability with any other system.

The reasons I feel it is appropriate for the default behavior to be to utilize the AWS Lambda vended context as a parent are that 1) this is the current behavior for all AWS Lambda instrumentation with the exception of the Java instrumentation and 2) this is instrumentation of an AWS service and it is reasonable to utilize the service-provided context propagation mechanism as the default.

tylerbenson · 2023-08-21T18:35:09Z

@Aneurysm9 the issue here isn't the propagation mechanism. X-Ray propagation is fine and I even submitted spec changes to formalize that further. Our issue from the beginning has always been the forked context that passes through AWS services, creating spans that aren't reported in the same manner (to the same backend) as the rest of the spans within the trace. Such behavior results in broken traces which is unacceptable.

trask · 2023-08-21T19:23:53Z

I believe that there is an expectation inherent in this objection that using the AWS Lambda vended spans as a parent context is only relevant to users of AWS X-Ray. This is not the case. It is possible to utilize the AWS X-Ray trace context propagation format without using AWS X-Ray as a trace backend. It is possible to do so in a way that would cause AWS Lambda to participate in that trace, vending segments to AWS X-Ray that can be obtained by the user and migrated to the trace backend of their choice.

@Aneurysm9 I'd like to understand this better. Could you link to docs for how e.g. a Jaeger user would set this up? thx

Aneurysm9 · 2023-08-21T20:02:25Z

@Aneurysm9 the issue here isn't the propagation mechanism. X-Ray propagation is fine and I even submitted spec changes to formalize that further. Our issue from the beginning has always been the forked context that passes through AWS services, creating spans that aren't reported in the same manner (to the same backend) as the rest of the spans within the trace. Such behavior results in broken traces which is unacceptable.

Now I'm confused. Is the issue the propagation mechanism, or is it not the propagation mechanism? Are broken traces acceptable or not?

Is this situation (a distributed trace passes through a system that is not in control of the application owner, resulting in spans reported to a trace backend that is not in the control of the application owner) not common to all distributed tracing backends and instrumentation? This is such a common situation that the specification requires that APIs must accept "[t]he parent Context or an indication that the new Span should be a root Span" at span creation to allow for this.

I also do not believe that there has ever been a situation where users of tracing backends other than AWS X-Ray have been wholly unable to create complete traces for interactions involving AWS Lambda function invocations. To the extent that their preferred parent context is carried by the invocation event and they have access to a TracerProvider and their desired context propagator, they are able to extract that context from the event and create a new span using it as the parent context. I'll allow that this is probably not the ideal user experience, but it is inaccurate to state otherwise.

carlosalberto · 2023-08-21T20:06:24Z

(Removing my changes request as per @trask's feedback)

jsuereth · 2023-08-21T20:15:00Z

IIUC - The key user issue we are debating (and the one we're arguing to prioritize) is as follows:

Scenario 1

Service A uses OTEL and propagates W3c headers.
Service A makes a request through an AWS service where x-ray spans will be created, but also where the original OTEL span headers are propagated.
Service B (living in an AWS environment) wants to have tracing that leverages both the spans from AWS x-ray and Service A.

How should we attach traces to give users the best o11y here?

A few things I'm assuming:

AWS x-ray will create a new trace_id, either because of security boundaries or just how the product works. (Note GCP does the same for Cloud Run / Cloud Functions). This is the issue that we're fixing
Users want the ability to see the WHOLE trace, meaning Service A, AWS x-ray internal trace and Service B.

In this case, the ONLY way to not have a broken trace is to "link" against the AWS x-ray trace vs. parent against it. I believe this SHOULD be the default, including apply this to GCP services that have the same problem. This allows users to define their own "security boundaries" of what propagation they allow without being bound to what we, as cloud providers, require for trace_id generation.

Scenario 2

There are users who will leverage OOTB trace sampling and generation of their cloud provider. GCP, e.g. provides tracing and sampling of Cloud Run as a free option. Users may want to track their traces at this "entry point" level. This works when a trace is not going through one of these security / trace_id ownership boundaries between services for that user. I agree there should be a way for users to get what they need here.

The big question

Which scenario should be the default / more likely for users. It was my understanding that Scenario 1 would be the new default behavior, with Scenario 2 requiring user configuration, but achievable.

A few comments from the linked issue that lead to me raising concerns:

Instead of having this behavior where the x-ray propagation ignores the context from the parent if the environment variable is set, it should always use the context from the event as the parent and link to the environment variable context if present. I'm not sure what implications this would have on x-ray though.

Per discussion in the FAAS SIG, we decided that the aws x-ray environment variable should be moved to a span link to avoid interfering with the configured propagators.

This last one coming from this pr, which was approved and merged.

Aneurysm9 · 2023-08-21T20:18:37Z

I agree with @trask's objection to this change. While I think there are better solutions I'm not opposed to adding a setting to prefer the x-ray env context over remote context as long as the default is to use the remote.

The context provided by AWS Lambda in the execution environment is a remote context. The question is which of the remote contexts to use when multiple are present and is complicated by the fact that the specification recommends specific sources of context carriers be used. I have tried to minimize the change to the current specification here to avoid getting into issues of separation of responsibilities between instrumentation author, instrumentation user, instrumentation implementations, context propagators, and user applications. I'm not opposed to having that discussion again as long as we stop pushing changes that will break users for whom the instrumentation is presently functioning.

By default, traces reported via OTLP should be complete, regardless of if x-ray is enabled or not. The only case where I would support the inverse default is if AWS made it trivial to export those internal x-ray spans through an otlp/collector pipeline. I've made such a request but have not received any response.

I'm not sure where OTLP entered the picture here and this is not something that any instrumentation can guarantee precisely because distributed tracing involves interacting with systems outside the control of the instrumented system. Regardless which serialization format and network protocol is used to communicate trace data an incomplete trace will be experienced if one participant in the trace sends data to one backend and another participant sends data to a different backend.

I will let the AWS Lambda team handle your product feature request, but I will note here for posterity that it was made five days ago in a repository relevant to container images useful to AWS Lambda users who use the container packaging format for their functions. This is likely not the correct forum for such a product feature request.

trask · 2023-08-21T20:48:21Z

A few things I'm assuming:

AWS x-ray will create a new trace_id, either because of security boundaries or just how the product works. (Note GCP does the same for Cloud Run / Cloud Functions). This is the issue that we're fixing

just checking my maths, doesn't the same problem exist even if the same traceId is used, since you still have to pick one of the spanIds to parent?

jsuereth · 2023-08-21T20:58:53Z

@trask That's true. However if the same trace_id is used, I think it's less likely to cause broken traces, but would cause missing spans. Depends on the backend implemetnation at that point.

However, we're starting to sidestep the main point. Instead of repeating, I'll link my more succinct comment: #263 (comment)

tylerbenson · 2023-08-21T21:00:06Z

I will let the AWS Lambda team handle your product feature request, but I will note here for posterity that it was made five days ago in a repository relevant to container images useful to AWS Lambda users who use the container packaging format for their functions. This is likely not the correct forum for such a product feature request.

@Aneurysm9 I understand that is likely not the most relevant location for such but it was the place that @rapphil recommended I provide such feedback and it seemed like a reasonable place to start such a conversation in public. If you have a better forum for that conversation, let me know and I'll be happy to take that suggestion elsewhere.
As for the timing, I added it after the recent SIG meeting as an alternative solution that seems like the only possible avenue for a win-win at this point as you can't seem to understand why other vendors are not happy with the default you're demanding.

Aneurysm9 · 2023-08-21T21:27:51Z

as you can't seem to understand why other vendors are not happy with the default you're demanding.

Please do not ascribe "demand"s to me that I have not made and please do not make assertions about my mental state that you cannot know.

I have indicated that the suggestion on open-telemetry/opentelemetry-specification#263 does not work for me, for a variety of reasons that have since been elaborated, and I have suggested an alternative.

pyohannes · 2023-08-22T12:25:50Z

Service A uses OTEL and propagates W3c headers.

Service A makes a request through an AWS service where x-ray spans will be created, but also where the original OTEL span headers are propagated.

Service B (living in an AWS environment) wants to have tracing that leverages both the spans from AWS x-ray and Service A.

For the case where the AWS service in question is a messaging system (SNS, SQS), this maps to scenarios extensively discussed in the messaging SIG, results of this discussion are captured in OTEP 220.

TLDR; We always recommend to create a link between Service A and Service B, which denotes a producer/consumer relationship. Depending on variations of the scenarios (batch-processing, or broker instrumentation) there can or cannot be links between A and B, or between A, B, and the broker. However, the link between A and B is an invariant for all scenarios.

I'm not familiar with all details of this discussion, so I might be a bit off, however, it seems to me that it is closely related to issue #652. In my opinion, the question of parent vs link is not so much related to propagation as it is related to the question of how traces should look like for the user.

Oberon00 · 2023-08-22T15:33:03Z

. Our issue from the beginning has always been the forked context that passes through AWS services, creating spans that aren't reported in the same manner (to the same backend) as the rest of the spans within the trace. Such behavior results in broken traces which is unacceptable.

@tylerbenson While we have had similar troubles with this, it is not really an AWS problem, but a problem that in general will be even more widespread when cloud providers adapt W3C TraceContext headers. Then you won't even be able to switch to W3C propagation as a workaround. You can observe this when trying to monitor google cloud functions. Google, fully W3C conformant, participates in the trace an will thus override your span ID. Multi-tenancy is simply a missing feature from OTel and also seems to be somehow missed in the W3C spec (for a potential solution see open-telemetry/opentelemetry-specification#366 (comment))

tylerbenson · 2023-08-22T18:07:58Z

@Oberon00 I agree and for that reason I suggest we work with cloud providers for easier ways to get access to those internal spans. Last week I submitted an issue on AWS's side: aws/aws-lambda-base-images#111 to see if I could get any reaction from them that doesn't involve just polling their API.

jsuereth · 2023-08-23T19:32:10Z

FYI - TC discussion and opinion on this PR: #277 (comment)

Aneurysm9 requested review from a team August 18, 2023 19:12

github-actions bot assigned reyang Aug 18, 2023

FaaS: Clarify requirements regarding handling of AWS Lambda-provided …

07435bc

…`SpanContext Signed-off-by: Anthony J Mirabella <[email protected]>

Aneurysm9 force-pushed the fix/lambdaSpanContext branch from 15384ec to 07435bc Compare August 18, 2023 19:12

Aneurysm9 mentioned this pull request Aug 18, 2023

Faas: Add an option to use the X-Ray env var as main parent. #263

Closed

jsuereth reviewed Aug 21, 2023

View reviewed changes

aabmass reviewed Aug 21, 2023

View reviewed changes

Aneurysm9 added 2 commits August 21, 2023 11:26

Merge branch 'main' of github.com:open-telemetry/semantic-conventions…

4304636

… into fix/lambdaSpanContext

update changelog to reflect requested change to change description

fc59e27

Signed-off-by: Anthony J Mirabella <[email protected]>

Aneurysm9 changed the title ~~FaaS: Clarify requirements regarding handling of AWS Lambda-provided SpanContext~~ FaaS: Change requirements regarding handling of AWS Lambda-provided SpanContext Aug 21, 2023

carlosalberto requested changes Aug 21, 2023

View reviewed changes

carlosalberto self-requested a review August 21, 2023 20:06

Aneurysm9 mentioned this pull request Aug 21, 2023

FaaS: Change requirements regarding handling of AWS Lambda-provided SpanContext (take 2) #277

Closed

Aneurysm9 closed this Aug 23, 2023

Aneurysm9 deleted the fix/lambdaSpanContext branch August 23, 2023 21:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FaaS: Change requirements regarding handling of AWS Lambda-provided `SpanContext` #272

FaaS: Change requirements regarding handling of AWS Lambda-provided `SpanContext` #272

Aneurysm9 commented Aug 18, 2023

Oberon00 commented Aug 21, 2023

jsuereth left a comment

jsuereth Aug 21, 2023

Aneurysm9 Aug 21, 2023

carlosalberto Aug 21, 2023

Aneurysm9 Aug 21, 2023

jsuereth Aug 21, 2023

Aneurysm9 Aug 21, 2023

Aneurysm9 commented Aug 21, 2023

aabmass Aug 21, 2023

Aneurysm9 Aug 21, 2023

Oberon00 commented Aug 21, 2023

carlosalberto left a comment

trask commented Aug 21, 2023

tylerbenson commented Aug 21, 2023 •

edited

Loading

Aneurysm9 commented Aug 21, 2023

tylerbenson commented Aug 21, 2023

trask commented Aug 21, 2023

Aneurysm9 commented Aug 21, 2023

carlosalberto commented Aug 21, 2023

jsuereth commented Aug 21, 2023

Aneurysm9 commented Aug 21, 2023

trask commented Aug 21, 2023

jsuereth commented Aug 21, 2023

tylerbenson commented Aug 21, 2023

Aneurysm9 commented Aug 21, 2023

pyohannes commented Aug 22, 2023

Oberon00 commented Aug 22, 2023

tylerbenson commented Aug 22, 2023

jsuereth commented Aug 23, 2023

FaaS: Change requirements regarding handling of AWS Lambda-provided SpanContext #272

FaaS: Change requirements regarding handling of AWS Lambda-provided SpanContext #272

Conversation

Aneurysm9 commented Aug 18, 2023

Oberon00 commented Aug 21, 2023

jsuereth left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Aneurysm9 commented Aug 21, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Oberon00 commented Aug 21, 2023

carlosalberto left a comment

Choose a reason for hiding this comment

trask commented Aug 21, 2023

tylerbenson commented Aug 21, 2023 • edited Loading

Aneurysm9 commented Aug 21, 2023

tylerbenson commented Aug 21, 2023

trask commented Aug 21, 2023

Aneurysm9 commented Aug 21, 2023

carlosalberto commented Aug 21, 2023

jsuereth commented Aug 21, 2023

Scenario 1

Scenario 2

The big question

Aneurysm9 commented Aug 21, 2023

trask commented Aug 21, 2023

jsuereth commented Aug 21, 2023

tylerbenson commented Aug 21, 2023

Aneurysm9 commented Aug 21, 2023

pyohannes commented Aug 22, 2023

Oberon00 commented Aug 22, 2023

tylerbenson commented Aug 22, 2023

jsuereth commented Aug 23, 2023

FaaS: Change requirements regarding handling of AWS Lambda-provided `SpanContext` #272

FaaS: Change requirements regarding handling of AWS Lambda-provided `SpanContext` #272

tylerbenson commented Aug 21, 2023 •

edited

Loading