Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kotlin] Incorrectly linked spans, incorrect durations when Coroutine Dispatchers are used #6502

Open
andrewloux opened this issue Aug 23, 2022 · 9 comments
Labels
bug Something isn't working repro provided

Comments

@andrewloux
Copy link

Describe the bug

When using the Java Agent & annotation driven auto-instrumentation with Kotlin - and the Coroutine Dispatcher is used (to dispatch some Coroutines outside of the main thread); we see broken traces consistently. Here's how the traces appear to be broken:

  1. Functions within a Dispatched block appear as their own traces - their spans untethered to their direct parent or calling method (but sometimes still tethered to the parent's parent).
  2. As a consequence of (1), you see two spans for the same method; when there was only one execution of it.
  3. Methods that use Dispatch blocks have durations that are inaccurately reported - they appear to finish much quicker than the method actually does.

Steps to reproduce

A minimal reproducible example below.

  companion object {
    private val log = logger {}

    fun runit() {
      runBlocking {
        log.info { "Starting the example" }
        breakingSpans()
      }
    }

    @JvmStatic
    fun main(args: Array<String>) {
      runit()
    }

    @WithSpan
    suspend fun breakingSpans() {
      kotlin.runCatching {
        blockingWork()
        usingIOPool()
      }
    }

    @WithSpan
    suspend fun blockingWork() {
      delay(2000L)
    }

    @WithSpan
    suspend fun usingIOPool() {
      return withContext(Dispatchers.IO) {
        fetchUpstreamResource()
        log.info { "Finished getting upstream resource" }
      }
    }

    @WithSpan
    suspend fun fetchUpstreamResource(){
      delay(2000L)
    }
  }

What did you expect to see?

  • I expect to see a single breakingSpans trace per run.
  • I expect to see fetchUpstreamResource, usingIOPool and blockingWork in the same trace.
  • I expect the durations of these methods to represent the time it took for them to complete - which is at least 2 seconds.

What did you see instead?

  • I see multiple (3) traces from breakingSpans
    • One with breakingSpans -> blockingWork (surprisingly and not the focus of this post since this is not Dispatched work but blockingWork's duration is also inaccurate)
    • One with breakingSpans -> usingIOPool (but duration of usingIOPool is inaccurate)
    • One with just fetchUpstreamResource (but duration of fetchUpstreamResource is inaccurate)

Some pictures below to see how this looks; but also see the the Logs all the way at the end.

image

image

image

What version are you using?

opentelemetry-javaagent - version: 1.15.0

Environment
Compiler: Azul (OpenJDK) 16.0.2
OS: MacOS Monterey

Additional context

The logs are quite interesting as well, and show the 3 traces being emitted:

        io.plutocard.ServerApp [req.id=null]  Starting blockingWork 
[otel.javaagent 2022-08-23 11:19:16:580 -0400] [main] INFO io.opentelemetry.exporter.logging.LoggingSpanExporter - 'Companion.blockingWork' : a67c5bfad57eb68d79dd239ea51dc0ea f0c9ea759b7231d3 INTERNAL [tracer: io.opentelemetry.opentelemetry-annotations-1.0:1.15.0-alpha] AttributesMap{data={thread.name=main, thread.id=1}, capacity=128, totalAddedValues=2}
[otel.javaagent 2022-08-23 11:19:16:580 -0400] [main] INFO io.opentelemetry.exporter.logging.LoggingSpanExporter - 'Companion.breakingSpans' : a67c5bfad57eb68d79dd239ea51dc0ea 09e8b492d7e466b7 INTERNAL [tracer: io.opentelemetry.opentelemetry-annotations-1.0:1.15.0-alpha] AttributesMap{data={thread.name=main, thread.id=1}, capacity=128, totalAddedValues=2}
2022-08-23 11:19:18.586 [main] INFO 
        io.plutocard.ServerApp [req.id=null]  Finished blockingWork 
2022-08-23 11:19:18.588 [main] INFO 
        io.plutocard.ServerApp [req.id=null] Started usingIOPool
[otel.javaagent 2022-08-23 11:19:18:587 -0400] [main] INFO io.opentelemetry.exporter.logging.LoggingSpanExporter - 'Companion.blockingWork' : f9518af01d2d50cfa01f90844baca7f7 2ec970a1fc21956d INTERNAL [tracer: io.opentelemetry.opentelemetry-annotations-1.0:1.15.0-alpha] AttributesMap{data={thread.name=main, thread.id=1}, capacity=128, totalAddedValues=2}
[otel.javaagent 2022-08-23 11:19:18:603 -0400] [main] INFO io.opentelemetry.exporter.logging.LoggingSpanExporter - 'Companion.usingIOPool' : d404221fd5b24595bda24fde15f1b780 e3843e87db13b61b INTERNAL [tracer: io.opentelemetry.opentelemetry-annotations-1.0:1.15.0-alpha] AttributesMap{data={thread.name=main, thread.id=1}, capacity=128, totalAddedValues=2}
[otel.javaagent 2022-08-23 11:19:18:603 -0400] [main] INFO io.opentelemetry.exporter.logging.LoggingSpanExporter - 'Companion.breakingSpans' : d404221fd5b24595bda24fde15f1b780 f2e62b9951c48a0f INTERNAL [tracer: io.opentelemetry.opentelemetry-annotations-1.0:1.15.0-alpha] AttributesMap{data={thread.name=main, thread.id=1}, capacity=128, totalAddedValues=2}
[otel.javaagent 2022-08-23 11:19:18:605 -0400] [DefaultDispatcher-worker-1] INFO io.opentelemetry.exporter.logging.LoggingSpanExporter - 'Companion.fetchUpstreamResource' : ef65b66b1081ac297646d5e0cd129ad4 cb76a767dbc0ef94 INTERNAL [tracer: io.opentelemetry.opentelemetry-annotations-1.0:1.15.0-alpha] AttributesMap{data={thread.name=DefaultDispatcher-worker-1, thread.id=20}, capacity=128, totalAddedValues=2}
2022-08-23 11:19:18.604 [DefaultDispatcher-worker-1] INFO 
        io.plutocard.ServerApp [req.id=null] Started fetchUpstreamResource
2022-08-23 11:19:20.611 [DefaultDispatcher-worker-1] INFO 
        io.plutocard.ServerApp [req.id=null] Finished fetchUpstreamResource
[otel.javaagent 2022-08-23 11:19:20:611 -0400] [DefaultDispatcher-worker-1] INFO io.opentelemetry.exporter.logging.LoggingSpanExporter - 'Companion.fetchUpstreamResource' : 0461ccfc74e6ae8b7d7cf1522c4d5f2c 2e4d70998f21930a INTERNAL [tracer: io.opentelemetry.opentelemetry-annotations-1.0:1.15.0-alpha] AttributesMap{data={thread.name=DefaultDispatcher-worker-1, thread.id=20}, capacity=128, totalAddedValues=2}
[otel.javaagent 2022-08-23 11:19:20:611 -0400] [main] INFO io.opentelemetry.exporter.logging.LoggingSpanExporter - 'Companion.breakingSpans' : 96cd29c74e0d1e99db3ab94876aa8797 faedee63fcaa99da INTERNAL [tracer: io.opentelemetry.opentelemetry-annotations-1.0:1.15.0-alpha] AttributesMap{data={thread.name=main, thread.id=1}, capacity=128, totalAddedValues=2}
@andrewloux andrewloux added the bug Something isn't working label Aug 23, 2022
@andrewloux
Copy link
Author

cc @trask since I saw you looking a similar Kotlin / broken traces issue when I was spelunking through the issues, trying to find something similar.

@trask
Copy link
Member

trask commented Sep 7, 2022

hi @andrewlouis93! since folks here are not primarily Kotlin devs, could you create a standalone repository with your repro, along with instructions how to build/run it from the command line? I think that would have the most luck of getting someone to try to dig in to see what's going on.

@andrewloux
Copy link
Author

@trask Done, here you go: https://github.com/andrewlouis93/otel-auto-sandbox

Keep me posted if y'all run into issues running it - I don't anticipate any problems.

@andrewloux
Copy link
Author

@trask Just wanted to share something I found that may be helpful;

The Apache Skywalking autoinstrumentation was also similarly broken with suspending code - see here (particularly the Run without Plugin traces) https://skywalking.apache.org/docs/skywalking-java/v8.10.0/en/setup/service-agent/java-agent/agent-optional-plugins/kotlin-coroutine-plugin/

The fix was shipped as an optional plugin: apache/skywalking#4165 (comment); perhaps we can draw inspiration from their implementation principle

@laurit
Copy link
Contributor

laurit commented Sep 15, 2022

@andrewlouis93 the way I see it there are 2 problems. Firstly using withContext breaks context propagation, I'll try to fix this. The rest of the problems happen because @WithSpan isn't currently suitable for kotlin coroutines. You are seeing multiple traces from breakingSpans because breakingSpans really is called 3 times (I guess this multiple call thing happens for suspend methods that call multiple other suspend methods?). The timings are off because these methods return before the delay completes. The bytecode modifications done for @WithSpan are not aware of the magic that kotlin is doing underneath. You should be able to overcome this by directly using the instrumenter or tracer api to create spans. If you write your tracing code in kotlin the compiler will do the correct thing. See https://techlab.bol.com/en/blog/measuring-execution-time-in-kotlin-coroutines for ideas how you could build your tracing api.

@andrewloux
Copy link
Author

Firstly using withContext breaks context propagation, I'll try to fix this

This would be brilliant, if this was intact - we would at least be able to build a reliable API. The link is interesting as well - thank you for sharing. I came across some other similar tracing specific abstractions as well. - manual tracing was still unreliable where withContext was being used.

I created this ticket because it did seem like it could be solved in the OTEL Java Agent so autoinstrumentation could actually work for coroutines based how Apache Skywalking patched their own Java Agent (I linked over here)

But yes - at the very minimum if we could build a reliable tracing API, we'd have a path forwards.

@andrewloux
Copy link
Author

@laurit Hello, wondering if you were able to give it a stab at fixing withContext ? If you have some thoughts about implementation, and need a hand - would be happy to lend one

@mpeyper
Copy link

mpeyper commented Nov 16, 2022

I'm also interested to hear if there has been any traction on this?

@laurit
Copy link
Contributor

laurit commented Nov 30, 2022

@andrewlouis93 sorry for the late reply. If I remember correctly after further testing I realized that there seemed to be a problem with withContext only when @WithSpan was used. So I gave up on this as @WithSpan in coroutines does not work correctly.
If you have use cases that don't use @WithSpan that don't work correctly test them with #7341 and see if it fixes these.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working repro provided
Projects
None yet
Development

No branches or pull requests

4 participants