Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Async Hook instrumentation significantly increases response time for Promise-heavy calls #1095

Open
blimmer opened this issue Sep 23, 2020 · 17 comments
Labels
async_hooks-perf Performance issues related to async_hooks. bug Something isn't working community core performance

Comments

@blimmer
Copy link
Contributor

blimmer commented Sep 23, 2020

Describe the bug

We noticed a significant increase in response times for GraphQL queries that produce large payloads after we enabled dd-trace for our project.

Here's what we've seen:

Date Node Version dd-trace Version scope Setting Response Size Response Time without dd-trace Response Time with dd-trace Percentage Increase
8/12/2020 13.14.0 0.20.3 async_hooks ~1.8MB 2.8s 30s 971.4%
8/12/2020 14.8.0 0.24.0 async_local_storage ~1.8MB 2.7s 7.4s 174.1%
8/12/2020 14.8.0 0.24.0 async_local_storage ~150kb 140ms 220ms 57.1%
9/23/2020 14.8.0 0.26.0 async_local_storage ~3.2MB 3.7s 8.7s 135.1%
9/23/2020 14.8.0 0.26.0 noop ~3.2MB 3.7s 3.8s 2.7%
9/23/2020 14.8.0 async_resource branch async_resource ~3.2MB 3.7s 9.9s 167.6%
9/23/2020 14.8.0 N/A (disabled)
const { createHook } = require('async_hooks'); createHook({ init() {} }).enable();
N/A (disabled) ~3.2MB 3.7s 7.4s 100%

As you can see, it looks like async_hooks are the primary reason we see the slowdown, and not necessarily anything specific to dd-trace.

The logic that triggers this especially pronounced behavior uses the following:

The query itself produces many promises (I'm unsure of the exact number, but probably in the thousands), which we believe is exacerbating the problem.

I was working synchronously with @rochdev and @stephenh on this issue via the Datadog Slack. He mentioned that he'll add some additional detail to this issue.

Environment

  • Operation system: Docker - node:14.8.0 image running on macOS host
  • Node version: 14.8.0
  • Tracer version: 0.26.0
  • Agent version: 7.22.0
@blimmer blimmer added the bug Something isn't working label Sep 23, 2020
@rochdev
Copy link
Member

rochdev commented Sep 24, 2020

For context, here is a bit of history behind this issue since it's been problematic for a while now.

Historically, context propagation APIs for Node have been using monkey patching on asynchronous code to track the asynchronous context. This approach has been used successfully by APM vendors for many years, even though it didn't support native code and could lose the context in some cases.

When async/await landed, monkey patching was no longer an option so everyone had to switch to async_hooks which is the official construct provided by Node to track the asynchronous context. Even though the module is still experimental to this day, it's the only way currently to do this. This is why it's always been the default implementation used by our scope manager in versions of Node that support it.

The main issue with async_hooks is that it's very heavy when used with promises. This is because the promise hooks exposed by v8 require a jump from JavaScript to native and back for every promise. There has been work done in the past few months to make async_hooks significantly lighter when a destroy hook is not used, for example when using AsyncLocalStorage, but there is still a fair bit of overhead when a lot of promises are created in quick successions. This means that even though we have a new scope manager that no longer relies on the destroy hook, we still incur this overhead.

Since the overhead comes from Node and not from the tracer, and we have to use this feature to support automatic context propagation, this is unfortunately an overhead we can't realistically avoid at the tracer level. It's also worth noting that in most cases, the overhead is low enough that it's not problematic.

In order to fix this completely, we'll have to fix async_hooks in Node itself, and also the corresponding promise hooks in v8. This is not small task and will take a while, but we'll update this issue as progress is done.

@rochdev
Copy link
Member

rochdev commented Nov 24, 2020

Progress for the v8 change can be tracked here.

@rochdev
Copy link
Member

rochdev commented Dec 8, 2020

Progress on the Node side can be tracked in nodejs/node#36394

@bengl bengl added the async_hooks-perf Performance issues related to async_hooks. label May 18, 2021
@bengl
Copy link
Collaborator

bengl commented May 19, 2021

In Node.js v16.2.0, significant performance improvements in async_hooks are present, so if you can test with this version of Node.js, hopefully that alleviates a lot of the overhead.

@artur-ma
Copy link

artur-ma commented May 19, 2021

@bengl I dont think there will be any improvement, even worse, I assume there will be some degradation
according to this PR
nodejs/node#36394
when using destroy hook (you are using it), there is some small degradation in performance

Before:

$ ./node-master benchmark/async_hooks/promises.js
async_hooks/promises.js asyncHooks="enabled" n=1000000: 348,189.5714361237
async_hooks/promises.js asyncHooks="enabledWithDestroy" n=1000000: 212,124.80165853762
async_hooks/promises.js asyncHooks="enabledWithInitOnly" n=1000000: 391,763.36348583025
async_hooks/promises.js asyncHooks="disabled" n=1000000: 1,670,456.5232560865

After:

$ ./node benchmark/async_hooks/promises.js  
async_hooks/promises.js asyncHooks="enabled" n=1000000: 1,153,166.7820469614
async_hooks/promises.js asyncHooks="enabledWithDestroy" n=1000000: 207,516.71584447258
async_hooks/promises.js asyncHooks="enabledWithInitOnly" n=1000000: 1,043,624.1457311497
async_hooks/promises.js asyncHooks="disabled" n=1000000: 1,603,204.3180576847

Here u set the destroy hook
https://github.com/DataDog/dd-trace-js/blob/master/packages/dd-trace/src/scope/async_hooks.js#L37

@rochdev
Copy link
Member

rochdev commented May 19, 2021

Here u set the destroy hook

The async_hooks scope manager is only used in old versions of Node. Any recent versions such as latest 12.x, 14.x or 16.x use instead the async_resource scope manager which only uses init.

@stephenh
Copy link
Contributor

Fwiw today we tried bumping to node 16.2.0 to get this perf fix, and are seeing errors in our app that does this ~admittedly somewhat esoteric pattern with async_hooks:

import { AsyncLocalStorage } from "async_hooks";
...
export const currentFlushSecret = new AsyncLocalStorage<{ flushSecret: number }>();
....
currentFlushSecret.run({ flushSecret: this.flushSecret }, async () => {
  // ...call various methods that do...
  const { flushSecret } = currentFlushSecret.getStore() || {};
  // ...and ensure flushSecret is the expected value
  if (flushSecret !== this.flushSecret) {
    throw new Error("invalid usage detected");
  }
});

Where the idea is that only code that is specifically ran within the run callback is allowed to touch/mutate the this / EntityManager instance, and all others mutations will be rejected. Basically we're using it as a hacky thread-local-ish mutex/lock.

Previously this "invalid usage detected" wasn't hit in our application's code, but it is being hit now in our test suite, either insinuating that a) our application code had a bug that node 16.1.0 wasn't catching because this AsyncLocalStorage was propagating a flushSecret when it wasn't supposed to, or b) node 16.2.0 has a regression where AsyncLocalStorage is not propagating flushSecret (to one of the methods invoked directly in, or transitively from, the run callback) when it is supposed to.

The ^ code is part of an open source project:

https://github.com/stephenh/joist-ts/blob/main/packages/orm/src/EntityManager.ts#L202

Although our currently failing tests are in an internal project. If we run the joist-ts public tests on node 16.2.0, they work just fine (and we do have tests that specifically exercise the flush secret behavior), so we don't have a repro yet.

We're going to work more tomorrow on getting a repro done, in the public joist project, and verifying whether it is truly a regression in 16.2.0 or else just a bug/misuse of AsyncLocalStorage (in this joist project) or a big/misuse in our application itself (that is now getting correctly detected).

So, I dunno, this is not a super-useful update yet, but mostly wanted to post as an FYI in case anyone on the datadog/nodejs side of things would look at the AsyncLocalStorage usage and immediately intuit "oh yes we're doing something wrong" or "oh shoot something in the node 16.2.0 change over looked this".

@rochdev
Copy link
Member

rochdev commented May 20, 2021

cc @Qard who would have the most insight to determine at a glance if there could be a regression causing this issue.

@stephenh
Copy link
Contributor

stephenh commented May 20, 2021

Okay @rochdev and @Qard , we believe we have a reproduction in this PR:

https://github.com/stephenh/joist-ts/pull/122/commits

Notice that it passes on 16.1.0:

https://app.circleci.com/pipelines/github/stephenh/joist-ts/547/workflows/247c2b9d-8d01-463f-bfce-f1163a36f221/jobs/581

But fails on 16.2.0:

https://app.circleci.com/pipelines/github/stephenh/joist-ts/547/workflows/247c2b9d-8d01-463f-bfce-f1163a36f221/jobs/579

A few notes:

  • This was actually easy to reproduce in this open source project, once we stopped passing --detectOpenHandles to jest
  • Similarly, running the tests with --inspect-brk (i.e. debugging the issue in webstorm/etc) makes the issue go away and the tests pass
  • You can run an individual test i.e. this one and see the issue without running the whole test suite
  • The Author.test.ts assumes you run the make db in the packages/integration-test folder to get the postgres db setup and running. Also a yarn + yarn build -w in the top-level directly is useful.
  • I've added some debugging code that shows the issue in this file

Specifically we log the intent to do an AsyncLocalStorage.run:

console.log(`Starting new AsyncLocalStorage.run with flushSecret=${this.flushSecret}`);

On the very first line within that run callback, we call getStore() to look up our local storage value:

console.log(`Inside AsyncLocalStorage.run, flushSecret=${b}`);

And this works, it returns 1. We then do a few await ...helper methods... and then, within the same run callback that did the "Inside" log, do another console.log:

console.log(`Inside AsyncLocalStorage.run, flushSecret=${a}`);

And this one is broken, we get back undefined instead of 1.

So, at some point between line 715 and line 731, we have lost track of the AsyncLocalStorage async resource.

Happy to chat more, but does this give enough direction to reproduce on your side? Apologies that this project doesn't have the best "Getting Started"/etc documentation, so let me know if there are any gotchas in getting in running.

(Also, technically this is likely an issue in node itself, as this joist open source project doesn't use ddtrace, we just happened to notice the issue while getting our internal project that uses both ddtrace and joist onto node 16.2.0, to leverage the ddtrace perf wins in this ticket. Would you like us open a new issue against nodejs directly? Happy to let you guys do that. Disclaimer assuming we're not doing something dumb on our side.)

Thanks!

@rochdev
Copy link
Member

rochdev commented May 21, 2021

@stephenh I'm able to reproduce with the project you've shared, but without being able to debug and with a codebase I don't know it's difficult to try to find the issue. Do you think you'd be able to extract only a small part of the code that would still reproduce the issue? Or maybe find the exact line where the context is lost? I'm not super familiar with Jest either which definitely doesn't help 😅

@stephenh
Copy link
Contributor

@rochdev that's a fair ask! I think we were so pleased with ourselves to having isolated the repro that we didn't push farther to make it as simple as possible. :-)

It is terribly ironic you mention jest, because that might be the issue...

I've created a minimal ~10 lines of code repro here:

https://github.com/stephenh/async-local-storage-repro

I'll defer to the readme for steps, but basically if you run code that does: 1) access AsyncLocalStorage.getStore 2) await a promise, 3) access AsyncLocalStorage.getStore again, on node 16.2.0 and jest, the getStore value is lost and comes back as undefined.

The matrix is:

  • Node 16.2.0 + just node: works
  • Node 16.2.0 + via jest: broken
  • Node 16.1.0 + just node: works
  • Node 16.1.0 + via jest: works

@rochdev
Copy link
Member

rochdev commented May 25, 2021

Possible cause of the new issue since 16.2.0: nodejs/node#38781

@blimmer
Copy link
Contributor Author

blimmer commented Jun 2, 2021

It looks like a fix should land in Node 16.3.0.

@Qard
Copy link
Contributor

Qard commented Aug 17, 2022

Closing as this seems to be resolved. If you disagree or there's anything I missed, let me know and I can reopen.

@Qard Qard closed this as completed Aug 17, 2022
@Qard Qard reopened this Aug 18, 2022
@blimmer
Copy link
Contributor Author

blimmer commented Mar 29, 2023

Sadly, I think this is still a problem with async_hooks in general. I tested a particularly slow/complex query on a client's backend in a few various scenarios (see also #2385). Node 18.13.0 in docker on MacOS.

JIT enabled? dd-trace enabled? Average response time over 50 requests
X 4064.92 ms
X X 4863.24 ms
5898.32 ms
X 8325.02 ms

When JIT is enabled, it significantly reduces the number of spans in our traces. As you can see from the results, enabling dd-trace makes our requests significantly slower.

I was talking with @rochdev and he had me add a no-op async_hook to try to separate dd-trace from general async_hook overhead. The hook looked like this:

import { createHook } from "node:async_hooks";

createHook({ init() {}, before() {}, after() {}, destroy() {} }).enable();

Just having an async_hook without dd-trace set up increased the average response time from 4064.92 ms to 7994.8 ms. So, even though dd-trace is adding some additional overhead, the vast majority of the problems comes from async_hooks.

It seems like maybe the async_hooks approach might be a dud, since they're so unperformant?

@Qard
Copy link
Contributor

Qard commented Mar 30, 2023

Yep, for exactly this reason there's been a bunch of discussion in Node.js core about a more performant replacement for AsyncLocalStorage which does not use async_hooks internally. There's unfortunately no alternative currently, so your option is accept some overhead or have no observability. It's a lot less overhead than it used to be, but can still be noticeable depending on how heavily an application uses promises. I'm hopeful that we'll have something much better in future Node.js versions though. :)

@rochdev
Copy link
Member

rochdev commented Mar 30, 2023

Discussion about the above can be found in nodejs/node#46265

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
async_hooks-perf Performance issues related to async_hooks. bug Something isn't working community core performance
Projects
None yet
Development

No branches or pull requests

6 participants