Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propagation of Tracing Context to AWS SDK SqsAsyncClient #262

Open
sondemar opened this issue Jul 4, 2024 · 5 comments
Open

Propagation of Tracing Context to AWS SDK SqsAsyncClient #262

sondemar opened this issue Jul 4, 2024 · 5 comments
Labels

Comments

@sondemar
Copy link

sondemar commented Jul 4, 2024

Hi, I am trying to propagate the tracing context with the Micrometer Observation API while using the AWS SDK SqsAsyncClient, which operates on the Netty Event Loop Model.

Even though I have registered ObservationThreadLocalAccessor and ObservationAwareSpanThreadLocalAccessor with ContextRegistry, and instrumented executors (configuration details of SqsAsyncClient I described in this discussion):

executor.setTaskDecorator(new ContextPropagatingTaskDecorator());

I am still unable to access the currently opened scope because the executors are invoked from the Netty EventLoop thread.

I am seeking any possible workaround until the issue is resolved.

The only solution I can think of involves using an implementation of ContextAccessor and a custom ObservationContextHolder:

executor.setTaskDecorator(runnable -> factory.captureAll(ObservationContextHolder.storedValues()).wrap(runnable));

where ObservationContextHolder properly stores values for the keys ObservationThreadLocalAccessor.KEY and ObservationAwareSpanThreadLocalAccessor.KEY.

However, this solution is not thread-safe.

Could you please advise on a proper workaround or solution?

@jonatan-ivanov
Copy link
Member

For the sake of completeness, there is also awspring/spring-cloud-aws#646 and netty/netty#8546. What you can do as a user is adding a 👍🏼 on the issue description and a comment that you need this. (I saw your comments on the spring-cloud-aws issue. 👍🏼)

Reactor Netty is instrumented, as far as I know it has support on the event loop/network level, maybe that's something you can enable and reuse?

Also, as far as I can understand, if you want to instrument SQS, you might be on the wrong level for that (network/http level). If you instrument the event loop and add tracing information to the HTTP request that you send to AWS, I don't think that information will be propagated to the client. I think what you should do instead is adding tracing information to the SQS message header (that will be sent to AWS over HTTP) and those are sent to the client when the SQS message is delivered. So instead of instrumenting the low-level HTTP client you should (wrap? and) instrument the SQS client itself.
Does this make sense?

Don't get me wrong, instrumenting Netty could be also useful but I think that's step two and you might not need to do it eventually.

@jonatan-ivanov jonatan-ivanov added waiting for feedback question Further information is requested labels Jul 4, 2024
@sondemar
Copy link
Author

sondemar commented Jul 5, 2024

For the sake of completeness, there is also awspring/spring-cloud-aws#646 and netty/netty#8546. What you can do as a user is adding a 👍🏼 on the issue description and a comment that you need this. (I saw your comments on the spring-cloud-aws issue. 👍🏼)

I am already engaged in awspring/spring-cloud-aws#646 with this PR.

Reactor Netty is instrumented, as far as I know, it has support on the event loop/network level, maybe that's something you can enable and reuse.

In the discussion, I pointed out that the AWS SDK project manages the Reactor API under the hood and only exposes the SqsAsyncClient based on asynchronous CompletableFuture.

If you instrument the event loop and add tracing information to the HTTP request that you send to AWS, I don't think that information will be propagated to the client. I think what you should do instead is add tracing information to the SQS message header (that will be sent to AWS over HTTP) and those are sent to the client when the SQS message is delivered. So instead of instrumenting the low-level HTTP client you should (wrap? and) instrument the SQS client itself.
Does this make sense?

I am already propagating tracing information to SQS with message headers (by using a custom version of SenderContext), but the key is that I would also like to maintain observability of the entire process of Spring Cloud AWS API invocation (including tracing, logging, and metrics).

Copy link

If you would like us to look at this issue, please provide the requested information. If the information is not provided within the next 7 days this issue will be closed.

@chemicL
Copy link
Collaborator

chemicL commented Jul 15, 2024

Hey, @sondemar ! I've been thinking a bit about your case and here are a few thoughts:

It sounds like you assume the underlying AWS implementation always uses the Executor wrapper you provide for all the CompletableFuture/CompletionStage processing. I don’t know whether that’s common.

Also, I can’t understand the ObservationContextHolder workaround - how would this info be properly populated and the ContextPropagatingTaskDecorator would not do its job? Can you explain?

As far as I understand, the usage is that at some point, within an Observation’s scope, you use the client. Then the client delegates to Netty and the ThreadLocal values might be gone. They might also not be gone if the processing reuses the same thread (!). Or even worse, if the current thread is utilized by Netty to do some other processing in the meantime. Then, when Netty delivers some signal and you execute the Completable* stack with the intended Executor wrapper you expect some TLs to be in the context. That can only work if there’s a carrier for that, like we have in Reactor - Context from which the TLs get restored.

In Reactor-Netty there is a way to restore TLs for every single task that Netty executes, but it's not recommended due to performance reasons. It could work for debugging purposes, though. The reason that can happen is because Reactor-Netty attaches the Subscriber's Context to the Channel when working in a continuation-style manner so it can be restored later for the user's chain.

@sondemar
Copy link
Author

Hi @chemicL ,

Thank you for your comprehensive answer!

Regarding your thoughts and questions:

It sounds like you assume the underlying AWS implementation always uses the Executor wrapper you provide for all the CompletableFuture/CompletionStage processing. I don’t know whether that’s common.

I believe it's not common, but possible, as mentioned in the documentation. This executor seems to be executed by threads used by EventLoopGroup.

Also, I can’t understand the ObservationContextHolder workaround - how would this info be properly populated and the ContextPropagatingTaskDecorator would not do its job? Can you explain?

I wanted to incorporate the concept of the Reactor context view, as mentioned in this article. In my case, ObservationContextHolder would maintain a pseudo-context managed by a dedicated ContextAccessor, inspired by ReactorContextAccessor.

As far as I understand, the usage is that at some point, within an Observation’s scope, you use the client. Then the client delegates to Netty and the ThreadLocal values might be gone

This is exactly the issue—I lose the Observation scope stored in the ThreadLocal at this point.

In Reactor-Netty there is a way to restore TLs for every single task that Netty executes, but it's not recommended due to performance reasons. It could work for debugging purposes, though. The reason that can happen is because Reactor-Netty attaches the Subscriber's Context to the Channel when working in a continuation-style manner so it can be restored later for the user's chain.

I believe this refers to ReactorContextAccessor, which I mentioned above.

Unfortunately, I do not see a straightforward solution. The last idea that comes to mind is to create a ContextSnapshot and attach it to ExecutionAttributes. This can then be read with a dedicated ExecutionInterceptor and used to populate the ThreadLocal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants