Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cats 3.x support #1127

Open
cmcmteixeira opened this issue Feb 25, 2022 · 9 comments
Open

Cats 3.x support #1127

cmcmteixeira opened this issue Feb 25, 2022 · 9 comments

Comments

@cmcmteixeira
Copy link

cmcmteixeira commented Feb 25, 2022

This issue describes the ongoing work here. The PR is obviously not yet mergeable (too many comments, printlns and debug code)

As previously discussed on discord, I attempted to instrument the internals of cats-effect 3.x .

This lead me through a bit of a journey through the internals of cats-effect which I'm yet to fully understand.

My findings should therefore be taken with a pinch of 🧂

###Instrumenting the runLoop

This was my initial approach to this problem.

Initially it seemed to work 👍 but later I found that it suffer from the same issues as the solutions bellow.

It also has the downside of inserting a Kamon.storeContext / Kamon.currentContext on every enter/exit of this loop which I fear is quite taxing performance wise.

###Instrumenting run() and "forks"

This methodology proved to work a bit better. It also avoided the previous performance downside altogether since most of the calls will be done around the run method and not the runLoop.

Unfortunately it doesn't work 100% of the times.

Current Problem

This brings us to the current state of affairs and the problem at hand.

Running the tests in the kamon.instrumentation.futures.cats.CatsIoInstrumentationSpec will show something very similar to this on the console:

run(enter)     | cFiber: 1696169892 | sFiber: NA         | FiberCtx: {}              | ThreadCtx {}              | Thread pool-4-thread-1
runLoop(Enter) | cFiber: 1696169892 | sFiber: NA         | FiberCtx: {}              | ThreadCtx {}              | Thread pool-4-thread-1
Context will be set
ScheduleNew    | cFiber: 1696169892 | sFiber: 1696169892 | FiberCtx: {}              | ThreadCtx {key=value}     | Thread pool-4-thread-1
runLoop(Exit)  | cFiber: 1696169892 | sFiber: NA         | FiberCtx: {key=value}     | ThreadCtx {key=value}     | Thread pool-4-thread-1
run(enter)     | cFiber: 1696169892 | sFiber: NA         | FiberCtx: {key=value}     | ThreadCtx {}              | Thread scala-execution-context-global-33
run(exit)      | cFiber: 1696169892 | sFiber: NA         | FiberCtx: {key=value}     | ThreadCtx {}              | Thread pool-4-thread-1
runLoop(Enter) | cFiber: 1696169892 | sFiber: NA         | FiberCtx: {key=value}     | ThreadCtx {key=value}     | Thread scala-execution-context-global-33
runLoop(Exit)  | cFiber: 1696169892 | sFiber: NA         | FiberCtx: {}              | ThreadCtx {key=value}     | Thread scala-execution-context-global-33
ScheduleNew    | cFiber: 1696169892 | sFiber: 1696169892 | FiberCtx: {}              | ThreadCtx {}              | Thread pool-6-thread-1
run(exit)      | cFiber: 1696169892 | sFiber: NA         | FiberCtx: {}              | ThreadCtx {key=value}     | Thread scala-execution-context-global-33

Though cryptic at first, the above provides the following information:

  • Column 1: action/method running
  • Column 2: The current fiber Id (cFiber)
  • Column 3: The fiber id that will be scheduled(sFiber)
  • Column 4: The context stored in the fiber(FiberCtx)
  • Column 5: The context stored in the current thread(ThreadCtx)
  • Column 6: The current thread
    There is also a line Context will be set which is printed just before setting the current ctx w/ tags key=value

After that line we can see that the Thread now holds the context (great) but the fiber does not. This will be changed however since whenever a fiber is (re)scheduled we capture the current context and set it. The code does exactly that:

println(s"ScheduleNew  ....")
val currentCtx = Kamon.currentContext()
setIfNotEmpty(fiber.asInstanceOf[HasContext])(currentCtx)

Immediately after we can see that when leaving the runLoop the fiber now holds the correct context (this is because the scheduled fiber and the running fiber are the same so when we set it on the fiber to be scheduled we also set it on the current fiber).

So far so good!, the problem starts on the the penultimate line.

We can see a ScheduleNew being run on Thread pool-6-thread-1.

This is the scheduler thread, not one from "our program". As such it makes sense for the ctx to be empty. However right it is, it unfortunately causes an issue w/ our instrumentation since the ScheduleNew will take the Ctx from the thread and store it in the Fiber effectively erasing it.. From that moment on we loose the ctx!

###Possible Solution
One possible solution is to (as it's being done in the PR) set the ctx only IF the context to be set is not empty.

This seems to address the issues listed above since the scheduler thread is unlikely to ever have a context; it does however raise two main problems:

  1. potentially it can create some unforeseen issues.
  2. we can no longer clear the current ctx w/ an empty one.. In the test provided we can see that even if setting it to be empty the subsequent steps still "have" the original ctx.

Additional note: The solution above instruments some private members of the cats-effect implementation which may be subject to change.

@ivantopo
Copy link
Contributor

ivantopo commented Mar 3, 2022

Hey @cmcmteixeira, thanks a lot for the detailed feedback here! I read it once, and definitely need to read it a couple times more to really digest what is going on.

As I was going through, this part of the tests got my attention:

          for {
            _ <- IO.delay(Kamon.storeContext(Context.Empty))
            _ <- IO.delay(Kamon.storeContext(context))
            _ <- Spawn[IO].evalOn(IO.sleep(0.seconds), anotherExecutionContext)
            afterCleaning <- IO.delay(Kamon.currentContext())
          }

I'm almost sure that we should never IO.delay storing a context. There is no guarantee that anything will ever clear that context, or that that execution will ever come back to the exact same thread before anything else so it can be cleared. Even though there might be things to improve in the instrumentation, I think this test is not appropriate. I'll let you know when I get any idea of what to do here.

@cmcmteixeira
Copy link
Author

cmcmteixeira commented Mar 3, 2022

we should never IO.delay storing a context

We have to actually. And it's already done in the http4s instrumentation.

There is no guarantee that anything will ever clear that context,

That's probably ok since before running/continuing the Runnable (aka the IOFober) sets the current context based on the value it stores.

On top of that we can also clean it whenever we "exit" the run method in SaveCurrentContextOnExit if need be


All that being said the test was wrong (I must have messed it up when I was trying to clean things for pushing).. It should make a bit more sense now (I've just pushed a new commit; also commented the println since they were making the test results hard to read)

@hughsimpson
Copy link
Contributor

I've tried the pr out. Generally it works pretty well, the only things that strikes me are that

  • sometimes the context escapes (in particular, reacting to a redis pubsub event runs in a random trace, if I don't manually instrument it. I guess this would apply for anything that didn't have an instrumentation module). Probably this is related to the fact that the context isn't cleaned?
  • everything is very flat... I would've expected flatmapping to create a new span as a child of the previous one, rather than as a sibling. Not sure which behaviour is more correct though

Overall definitely much better than no ce3 support though!

@hughsimpson
Copy link
Contributor

hughsimpson commented Mar 17, 2022

Out of curiousity @cmcmteixeira, have you raised anything in the typelevel discord? I imagine you could attract some help from there...

Edit: just found your thread from 22nd Feb. Alas, I guess we're on our own 😂

@cmcmteixeira
Copy link
Author

  • Regarding the "context escaping" probably yes! Since we don't clean the context on exiting the run loop if the thread is reused then it may happen that
  • Not sure I understand what you mean by things being flat ? I even have a test where I check the parent/child relationship.

As for the cats-effect side of things haven't really asked. My previous question was related but not quite the same. I simply didn't have the time yet to deal with that unfortunately..

@hughsimpson
Copy link
Contributor

On the flatness I'm probably off-base tbh. I'm using fs2 streams in some places which is also where I'm seeing this so I'll have a look at that in the next month or so, it probably needs its own handling

@m50d
Copy link
Contributor

m50d commented Aug 10, 2022

Hi, I took a look at this and I think I can see a cleaner way to do it, which passes this test while clearing the context on exit. Essentially:

  • Continue stashing the current context in the fiber on creation
  • Restore the fiber context as the current context on entering run() or restore()
  • Stash the span for that context into a threadlocal, and close/clear it on exiting run() or suspend().

@cmcmteixeira what do you think? I'm happy to work this up into a production-quality PR if we like, or if you'd rather do the full implementation I'm happy to defer to you.

@m50d
Copy link
Contributor

m50d commented Aug 10, 2022

I guess technically we should only restore the context on a successful return from restore (i.e. if this thread won the race to take over this fiber's run loop), but you get the idea.

@cmcmteixeira
Copy link
Author

Hey @m50d
It was a while ago that I did this so I'm a bit blank on the details.. I fear it might not be as easy as handling those function calls but I think it's worth the try for sure !

Also, it might be useful to have a look at how they handle var localState: IOLocalState in the IOFiber. This is,if I'm not mistaken, roughly the equivalent of ThreadLocal.

As for the implementation, you go ahead @m50d :) , I unfortunately don't have the bandwidth at the moment to handle this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants