Async Hook instrumentation significantly increases response time for Promise-heavy calls #1095

blimmer · 2020-09-23T18:07:35Z

Describe the bug

We noticed a significant increase in response times for GraphQL queries that produce large payloads after we enabled dd-trace for our project.

Here's what we've seen:

Date	Node Version	dd-trace Version	`scope` Setting	Response Size	Response Time without `dd-trace`	Response Time with `dd-trace`	Percentage Increase
8/12/2020	13.14.0	0.20.3	`async_hooks`	~1.8MB	2.8s	30s	971.4%
8/12/2020	14.8.0	0.24.0	`async_local_storage`	~1.8MB	2.7s	7.4s	174.1%
8/12/2020	14.8.0	0.24.0	`async_local_storage`	~150kb	140ms	220ms	57.1%
9/23/2020	14.8.0	0.26.0	`async_local_storage`	~3.2MB	3.7s	8.7s	135.1%
9/23/2020	14.8.0	0.26.0	`noop`	~3.2MB	3.7s	3.8s	2.7%
9/23/2020	14.8.0	`async_resource` branch	`async_resource`	~3.2MB	3.7s	9.9s	167.6%
9/23/2020	14.8.0	N/A (disabled) `const { createHook } = require('async_hooks'); createHook({ init() {} }).enable();`	N/A (disabled)	~3.2MB	3.7s	7.4s	100%

As you can see, it looks like async_hooks are the primary reason we see the slowdown, and not necessarily anything specific to dd-trace.

The logic that triggers this especially pronounced behavior uses the following:

Joist ORM, which uses dataloader under the hood
TypeScript transpiled code, targetting ES2019, meaning we're using async/await vs. Promise

The query itself produces many promises (I'm unsure of the exact number, but probably in the thousands), which we believe is exacerbating the problem.

I was working synchronously with @rochdev and @stephenh on this issue via the Datadog Slack. He mentioned that he'll add some additional detail to this issue.

Environment

Operation system: Docker - node:14.8.0 image running on macOS host
Node version: 14.8.0
Tracer version: 0.26.0
Agent version: 7.22.0

The text was updated successfully, but these errors were encountered:

rochdev · 2020-09-24T16:32:22Z

For context, here is a bit of history behind this issue since it's been problematic for a while now.

Historically, context propagation APIs for Node have been using monkey patching on asynchronous code to track the asynchronous context. This approach has been used successfully by APM vendors for many years, even though it didn't support native code and could lose the context in some cases.

When async/await landed, monkey patching was no longer an option so everyone had to switch to async_hooks which is the official construct provided by Node to track the asynchronous context. Even though the module is still experimental to this day, it's the only way currently to do this. This is why it's always been the default implementation used by our scope manager in versions of Node that support it.

The main issue with async_hooks is that it's very heavy when used with promises. This is because the promise hooks exposed by v8 require a jump from JavaScript to native and back for every promise. There has been work done in the past few months to make async_hooks significantly lighter when a destroy hook is not used, for example when using AsyncLocalStorage, but there is still a fair bit of overhead when a lot of promises are created in quick successions. This means that even though we have a new scope manager that no longer relies on the destroy hook, we still incur this overhead.

Since the overhead comes from Node and not from the tracer, and we have to use this feature to support automatic context propagation, this is unfortunately an overhead we can't realistically avoid at the tracer level. It's also worth noting that in most cases, the overhead is low enough that it's not problematic.

In order to fix this completely, we'll have to fix async_hooks in Node itself, and also the corresponding promise hooks in v8. This is not small task and will take a while, but we'll update this issue as progress is done.

rochdev · 2020-11-24T18:03:44Z

Progress for the v8 change can be tracked here.

rochdev · 2020-12-08T20:54:58Z

Progress on the Node side can be tracked in nodejs/node#36394

bengl · 2021-05-19T17:48:56Z

In Node.js v16.2.0, significant performance improvements in async_hooks are present, so if you can test with this version of Node.js, hopefully that alleviates a lot of the overhead.

artur-ma · 2021-05-19T22:37:39Z

@bengl I dont think there will be any improvement, even worse, I assume there will be some degradation
according to this PR
nodejs/node#36394
when using destroy hook (you are using it), there is some small degradation in performance

Before:

$ ./node-master benchmark/async_hooks/promises.js
async_hooks/promises.js asyncHooks="enabled" n=1000000: 348,189.5714361237
async_hooks/promises.js asyncHooks="enabledWithDestroy" n=1000000: 212,124.80165853762
async_hooks/promises.js asyncHooks="enabledWithInitOnly" n=1000000: 391,763.36348583025
async_hooks/promises.js asyncHooks="disabled" n=1000000: 1,670,456.5232560865

After:

$ ./node benchmark/async_hooks/promises.js  
async_hooks/promises.js asyncHooks="enabled" n=1000000: 1,153,166.7820469614
async_hooks/promises.js asyncHooks="enabledWithDestroy" n=1000000: 207,516.71584447258
async_hooks/promises.js asyncHooks="enabledWithInitOnly" n=1000000: 1,043,624.1457311497
async_hooks/promises.js asyncHooks="disabled" n=1000000: 1,603,204.3180576847

Here u set the destroy hook
https://github.com/DataDog/dd-trace-js/blob/master/packages/dd-trace/src/scope/async_hooks.js#L37

rochdev · 2021-05-19T23:25:04Z

Here u set the destroy hook

The async_hooks scope manager is only used in old versions of Node. Any recent versions such as latest 12.x, 14.x or 16.x use instead the async_resource scope manager which only uses init.

stephenh · 2021-05-19T23:39:49Z

Fwiw today we tried bumping to node 16.2.0 to get this perf fix, and are seeing errors in our app that does this ~admittedly somewhat esoteric pattern with async_hooks:

import { AsyncLocalStorage } from "async_hooks";
...
export const currentFlushSecret = new AsyncLocalStorage<{ flushSecret: number }>();
....
currentFlushSecret.run({ flushSecret: this.flushSecret }, async () => {
  // ...call various methods that do...
  const { flushSecret } = currentFlushSecret.getStore() || {};
  // ...and ensure flushSecret is the expected value
  if (flushSecret !== this.flushSecret) {
    throw new Error("invalid usage detected");
  }
});

Where the idea is that only code that is specifically ran within the run callback is allowed to touch/mutate the this / EntityManager instance, and all others mutations will be rejected. Basically we're using it as a hacky thread-local-ish mutex/lock.

Previously this "invalid usage detected" wasn't hit in our application's code, but it is being hit now in our test suite, either insinuating that a) our application code had a bug that node 16.1.0 wasn't catching because this AsyncLocalStorage was propagating a flushSecret when it wasn't supposed to, or b) node 16.2.0 has a regression where AsyncLocalStorage is not propagating flushSecret (to one of the methods invoked directly in, or transitively from, the run callback) when it is supposed to.

The ^ code is part of an open source project:

https://github.com/stephenh/joist-ts/blob/main/packages/orm/src/EntityManager.ts#L202

Although our currently failing tests are in an internal project. If we run the joist-ts public tests on node 16.2.0, they work just fine (and we do have tests that specifically exercise the flush secret behavior), so we don't have a repro yet.

We're going to work more tomorrow on getting a repro done, in the public joist project, and verifying whether it is truly a regression in 16.2.0 or else just a bug/misuse of AsyncLocalStorage (in this joist project) or a big/misuse in our application itself (that is now getting correctly detected).

So, I dunno, this is not a super-useful update yet, but mostly wanted to post as an FYI in case anyone on the datadog/nodejs side of things would look at the AsyncLocalStorage usage and immediately intuit "oh yes we're doing something wrong" or "oh shoot something in the node 16.2.0 change over looked this".

rochdev · 2021-05-20T01:10:17Z

cc @Qard who would have the most insight to determine at a glance if there could be a regression causing this issue.

stephenh · 2021-05-20T20:23:11Z

Okay @rochdev and @Qard , we believe we have a reproduction in this PR:

https://github.com/stephenh/joist-ts/pull/122/commits

Notice that it passes on 16.1.0:

https://app.circleci.com/pipelines/github/stephenh/joist-ts/547/workflows/247c2b9d-8d01-463f-bfce-f1163a36f221/jobs/581

But fails on 16.2.0:

https://app.circleci.com/pipelines/github/stephenh/joist-ts/547/workflows/247c2b9d-8d01-463f-bfce-f1163a36f221/jobs/579

A few notes:

This was actually easy to reproduce in this open source project, once we stopped passing --detectOpenHandles to jest
Similarly, running the tests with --inspect-brk (i.e. debugging the issue in webstorm/etc) makes the issue go away and the tests pass
You can run an individual test i.e. this one and see the issue without running the whole test suite
The Author.test.ts assumes you run the make db in the packages/integration-test folder to get the postgres db setup and running. Also a yarn + yarn build -w in the top-level directly is useful.
I've added some debugging code that shows the issue in this file

Specifically we log the intent to do an AsyncLocalStorage.run:

console.log(`Starting new AsyncLocalStorage.run with flushSecret=${this.flushSecret}`);

On the very first line within that run callback, we call getStore() to look up our local storage value:

console.log(`Inside AsyncLocalStorage.run, flushSecret=${b}`);

And this works, it returns 1. We then do a few await ...helper methods... and then, within the same run callback that did the "Inside" log, do another console.log:

console.log(`Inside AsyncLocalStorage.run, flushSecret=${a}`);

And this one is broken, we get back undefined instead of 1.

So, at some point between line 715 and line 731, we have lost track of the AsyncLocalStorage async resource.

Happy to chat more, but does this give enough direction to reproduce on your side? Apologies that this project doesn't have the best "Getting Started"/etc documentation, so let me know if there are any gotchas in getting in running.

(Also, technically this is likely an issue in node itself, as this joist open source project doesn't use ddtrace, we just happened to notice the issue while getting our internal project that uses both ddtrace and joist onto node 16.2.0, to leverage the ddtrace perf wins in this ticket. Would you like us open a new issue against nodejs directly? Happy to let you guys do that. Disclaimer assuming we're not doing something dumb on our side.)

Thanks!

rochdev · 2021-05-21T17:51:44Z

@stephenh I'm able to reproduce with the project you've shared, but without being able to debug and with a codebase I don't know it's difficult to try to find the issue. Do you think you'd be able to extract only a small part of the code that would still reproduce the issue? Or maybe find the exact line where the context is lost? I'm not super familiar with Jest either which definitely doesn't help 😅

stephenh · 2021-05-21T18:33:56Z

@rochdev that's a fair ask! I think we were so pleased with ourselves to having isolated the repro that we didn't push farther to make it as simple as possible. :-)

It is terribly ironic you mention jest, because that might be the issue...

I've created a minimal ~10 lines of code repro here:

https://github.com/stephenh/async-local-storage-repro

I'll defer to the readme for steps, but basically if you run code that does: 1) access AsyncLocalStorage.getStore 2) await a promise, 3) access AsyncLocalStorage.getStore again, on node 16.2.0 and jest, the getStore value is lost and comes back as undefined.

The matrix is:

Node 16.2.0 + just node: works
Node 16.2.0 + via jest: broken
Node 16.1.0 + just node: works
Node 16.1.0 + via jest: works

rochdev · 2021-05-25T20:07:55Z

Possible cause of the new issue since 16.2.0: nodejs/node#38781

blimmer · 2021-06-02T20:05:33Z

It looks like a fix should land in Node 16.3.0.

Qard · 2022-08-17T11:45:36Z

Closing as this seems to be resolved. If you disagree or there's anything I missed, let me know and I can reopen.

blimmer · 2023-03-29T23:40:54Z

Sadly, I think this is still a problem with async_hooks in general. I tested a particularly slow/complex query on a client's backend in a few various scenarios (see also #2385). Node 18.13.0 in docker on MacOS.

JIT enabled?	dd-trace enabled?	Average response time over 50 requests
√	X	4064.92 ms
X	X	4863.24 ms
√	√	5898.32 ms
X	√	8325.02 ms

When JIT is enabled, it significantly reduces the number of spans in our traces. As you can see from the results, enabling dd-trace makes our requests significantly slower.

I was talking with @rochdev and he had me add a no-op async_hook to try to separate dd-trace from general async_hook overhead. The hook looked like this:

import { createHook } from "node:async_hooks";

createHook({ init() {}, before() {}, after() {}, destroy() {} }).enable();

Just having an async_hook without dd-trace set up increased the average response time from 4064.92 ms to 7994.8 ms. So, even though dd-trace is adding some additional overhead, the vast majority of the problems comes from async_hooks.

It seems like maybe the async_hooks approach might be a dud, since they're so unperformant?

Qard · 2023-03-30T00:02:18Z

Yep, for exactly this reason there's been a bunch of discussion in Node.js core about a more performant replacement for AsyncLocalStorage which does not use async_hooks internally. There's unfortunately no alternative currently, so your option is accept some overhead or have no observability. It's a lot less overhead than it used to be, but can still be noticeable depending on how heavily an application uses promises. I'm hopeful that we'll have something much better in future Node.js versions though. :)

rochdev · 2023-03-30T02:07:59Z

Discussion about the above can be found in nodejs/node#46265

blimmer added the bug Something isn't working label Sep 23, 2020

rochdev added community core labels Sep 23, 2020

rochdev mentioned this issue Feb 24, 2021

add synchronous scope manager #1252

Merged

rochdev mentioned this issue May 13, 2021

Node process performance degradation #272

Closed

bengl added the async_hooks-perf Performance issues related to async_hooks. label May 18, 2021

bengl mentioned this issue May 19, 2021

Performance of async_hooks scope manager #695

Closed

blimmer mentioned this issue May 20, 2021

chore: show Node 16.2.0 issue in CI joist-orm/joist-orm#122

Closed

3 tasks

Qard mentioned this issue May 22, 2021

Node 16.2.0 with Jest break AsyncLocalStorage fastify/fastify-request-context#49

Closed

2 tasks

kibertoad mentioned this issue May 22, 2021

Jest breaks AsyncLocalStorage when running Node 16.2.0 jestjs/jest#11435

Closed

rochdev mentioned this issue Jul 1, 2021

Multiple spans per request + 50% performance drop #1431

Closed

bengl added the performance label Apr 14, 2022

Qard closed this as completed Aug 17, 2022

Qard reopened this Aug 18, 2022

temich mentioned this issue Feb 25, 2024

refactor(exposition): remove express toa-io/toa#525

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Async Hook instrumentation significantly increases response time for Promise-heavy calls #1095

Async Hook instrumentation significantly increases response time for Promise-heavy calls #1095

blimmer commented Sep 23, 2020 •

edited

Loading

rochdev commented Sep 24, 2020

rochdev commented Nov 24, 2020

rochdev commented Dec 8, 2020

bengl commented May 19, 2021

artur-ma commented May 19, 2021 •

edited

Loading

rochdev commented May 19, 2021

stephenh commented May 19, 2021

rochdev commented May 20, 2021

stephenh commented May 20, 2021 •

edited

Loading

rochdev commented May 21, 2021

stephenh commented May 21, 2021

rochdev commented May 25, 2021

blimmer commented Jun 2, 2021

Qard commented Aug 17, 2022

blimmer commented Mar 29, 2023

Qard commented Mar 30, 2023

rochdev commented Mar 30, 2023

Async Hook instrumentation significantly increases response time for Promise-heavy calls #1095

Async Hook instrumentation significantly increases response time for Promise-heavy calls #1095

Comments

blimmer commented Sep 23, 2020 • edited Loading

rochdev commented Sep 24, 2020

rochdev commented Nov 24, 2020

rochdev commented Dec 8, 2020

bengl commented May 19, 2021

artur-ma commented May 19, 2021 • edited Loading

rochdev commented May 19, 2021

stephenh commented May 19, 2021

rochdev commented May 20, 2021

stephenh commented May 20, 2021 • edited Loading

rochdev commented May 21, 2021

stephenh commented May 21, 2021

rochdev commented May 25, 2021

blimmer commented Jun 2, 2021

Qard commented Aug 17, 2022

blimmer commented Mar 29, 2023

Qard commented Mar 30, 2023

rochdev commented Mar 30, 2023

blimmer commented Sep 23, 2020 •

edited

Loading

artur-ma commented May 19, 2021 •

edited

Loading

stephenh commented May 20, 2021 •

edited

Loading