-
Notifications
You must be signed in to change notification settings - Fork 29.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
async_hooks: fix async/await context loss in AsyncLocalStorage #33189
Conversation
9470ecf
to
fe17123
Compare
fe17123
to
ec6b048
Compare
cc @nodejs/async_hooks @nodejs/diagnostics |
Yes, the problem exists in bluebird too. Every userland implementation of promises will have the same problem because it doesn't matter how the userland implementation itself works, it matters how the native promise facilities to upgrade a thenable to a native promise for the The problem is outlined here: const cls = new AsyncLocalStorage()
const data = { foo: 'bar' }
function thenable() {
return {
then(cb) {
assert.strictEqual(data, cls.getStore()) // fail
cb()
}
}
}
cls.run(data, async () => {
assert.strictEqual(data, cls.getStore()) // pass
await thenable()
assert.strictEqual(data, cls.getStore()) // pass
}) So basically, there is a deadzone within the thenable itself between when the Ideally this should be fixed at a lower level, and I'm working on that. This is just a possible fix which I know works at this level. |
What do you think about a configuration passed to the |
The problem shown in your snippet can be easily fixed in the thenable implementation (thus, in user-land) by integrating with I've slightly modified your snippet to make it a valid test and added the missing 'use strict';
require('../common');
const assert = require('assert');
const { AsyncLocalStorage, AsyncResource } = require('async_hooks');
const asyncLocalStorage = new AsyncLocalStorage();
const data = { foo: 'bar' };
function thenable() {
const res = new AsyncResource('Thenable'); // here is the main part
return {
then(cb) {
res.runInAsyncScope(() => { // this call is also important
assert.strictEqual(data, asyncLocalStorage.getStore());
cb();
});
}
};
}
async function main() {
await asyncLocalStorage.run(data, async () => {
assert.strictEqual(data, asyncLocalStorage.getStore());
await thenable();
assert.strictEqual(data, asyncLocalStorage.getStore());
});
}
main(); |
That works in the case where you have a factory function which you can put the construction of the |
For context, here's the That is merged into a bunch of other classes, none of which have sufficient awareness of the point at which the runner would be invoked to be able to know when to create an |
Another side note: a fix of some sort for this needs to exist in core for domains to be stable. Domains need to be able to track through internals to be able to adequately contain the execution and internal errors. |
}, | ||
|
||
after(asyncId) { | ||
depth--; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m -1 to this fix. This is going to slow everything down because of the use use of before and after hooks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking a config option like trackAsyncAwait
which would switch between two hook sets so it only does the before/after if a user of AsyncLocalStorage
has explicitly requested it. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. I think we should investigate if there is another possible fix first, this API is extremely nice and losing so much perf would not be good for the ecosystem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I'm trying to figure out a lower-level fix. I just made this as a possible higher-level solution for now, until we can come up with something better. Agreed it's not great though, needing the extra before/after hooks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering that this problem is not new of AL, I don't think this should be rushed in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that's fine. Just putting something up that I can iterate on. If the possible lower-level fix is what it needs to be to land, so be it. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That works in the case where you have a factory function which you can put the construction of the
AsyncResource
in, but the reality is that the vast majority of thenable issues derive from fluid or reusable objects.
If a custom thenable can return different results on multiple calls to then
, it's broken from the Promise/A+ spec perspective. We shouldn't try to fix such libraries in the core.
In case of knex, which seems to follow the spec and return the same object, an AsyncResource
could be created at the beginning of the fluent chain and when then
is called a runInAsyncScope
function could be called, just like it's done in #33189 (comment). As that comment shows, it's enough to integrate a custom thenable with async/await + AsyncLocalStorage
.
This means that almost all existing cases of
AsyncResource
used in the wild today do not work with async/await.
Could you provide examples of situations when a well-behaved custom thenable can't be integrated with async_hooks
by using AsyncResource
? If there are any and this PR fixes them, they should be covered with tests.
In general, I don't see how this PR fixes any real problems with custom thenables. IMO we already have the proper solutions, namely, AsyncResource
and native promises. Both of these options will integrate user-land module with async_hooks
-based CLS implementations, when used properly. This statement also applies to non-async/await promise syntax.
If that's required, I can also go into more details of potential context loss cases in user-land modules, that I'm aware of.
The problem though is that the before/after will link back to where the With knex, the You can do You can also reuse intermediate results: const baseQuery = knex('accounts as a1')
.leftJoin('accounts as a2', function() {
this.on('a1.email', '<>', 'a2.email');
})
.select(['a1.email', 'a2.email'])
const data = await baseQuery.where(knex.raw('a1.id = 1'))
const data = await baseQuery.where(knex.raw('a1.id = 2')) The So where would the The cognitive burden of expecting the ecosystem to fix our context loss problems for us instead of fixing it properly ourselves can get very high. We need to fix this properly or we will continue to have modules like this which break context, forcing us to do even worse hacks to attempt to restore it. I'm not arguing for this fix specifically but that we absolutely need some fix for this. |
Looks like If it's not the case and knex doesn't have the the concept of terminal operations (which sounds like a bad design to me), they could at least introduce a list of potentially terminal operations and only create A side note. The concept of reusable thenables seems to be a broken design to me. Custom thenables should behave like native Promises, i.e. once resolved they always return the same result. In case of knex it seems that the query object serves as a thenable and it can be mutated. What's even worse, each mutation can return a different result in calls of
Considering your arguments, I still don't think we should fix anything, as we do provide Embedder API to do the job. Instead we should educate library maintainers how to integrate correctly with |
My thoughts on this: If we don't fix this in core we may end up in having userland cls modules including such a fix which then "just work for users" but |
However, these is a much simpler way to correctly integrate with
The problem with this workaround is that we don't have any user reports that describe real world bugs. Thus, this workaround seems to be synthetic and claims to fix the problem for libraries that have weird thenable API, like knex. On the other hand, client libraries for databases and data stores that deal with TCP sockets trigger operation result callbacks on |
@Qard I do not understand which usage of knex is broken with our current implementation. I tried something like this in a request handler: // async/await
queryBuilder.where(something).andWhere(somethingElse)
console.log(cls.getStore());
const result = await queryBuilder;
console.log(cls.getStore());
// then
queryBuilder.where(something).andWhere(somethingElse)
console.log(cls.getStore());
queryBuilder.then((result) => {
console.log(cls.getStore());
}); In all cases, |
@targos The problem is not continuing context, it is internal context. If you need to get the store from somewhere within the execution of the As shown in #33189 (comment), the context is available before and after a thenable, but not within it. @puzpuzpuz We don't see reports here because the issue only surfaces in APM products, the actual users of this API. As someone that has worked in APM for half a decade building three different APM products, I am reporting this. I have seen this issue reported on my agents literally hundreds of times and have had to do horrible and fragile hacks to attempt to get around this. |
Could you check my considerations above? Your fix may not resolve the problem in many cases. Also, wouldn't it better to submit patches against such libraries instead of trying to place a (partial) workaround in the core? I don't think there are a lot of popular libraries that belong to the discussed group. |
As I already explained: some libraries, especially knex, have no clear place to put a fix. I have patched a few libraries with As for your comment about it being a "partial" fix--I don't see how that is the case. It certainly doesn't 100% eliminate all cases where |
If I would be working on APM and would be considering monkey patching knex to make it I would allocate an if (this.thenCalled) {
this.asyncResource = new AsyncResource('KnexQuery');
this.thenCalled = false;
} Of course, the callback in The idea here is that the Hopefully, you'll find this helpful.
It's not necessary about queuing. See #33189 (comment) |
That's the comment I'm referring to. The issue you mention is separate and easily patchable if the context is available somewhere else in the chain. Without a fix for the thenables issue however, there might not be a context accessible at the appropriate place to patch that. |
Do you see any problems with the monkey patch logic described in #33189 (comment) ? That logic could be submitted as a patch for knex (protected by a config option), if the authors are totally against migrating to native promises. |
Lots of potential timing issues. What if you do two queries at the same time with Also, using native promises would require a breaking change which wouldn't help all the people stuck on the older versions. APM vendors need to support whatever module set the customer has in their app. We can't just tell them they can't use our product unless they update all their modules. Many dependencies they will not have control of, and actually quite often APM gets used by companies that built something once, years ago, and do not intend to update the code until absolutely necessary. I know of at least one very major company still using Node.js 0.8 in production because it just works, and they will probably never update it until either the server catches fire or they go out of business. 🙄 |
I put it behind an option. I'll continue trying to figure out a lower-level fix. |
Promise/A+ spec assumes that thenables expose a spec-compliant
I'm afraid that such generalization is not applicable here. I wouldn't be having concerns if the following would not hold:
Once again, if the fix is hidden under a constructor option and there is a consensus in the work group, I'm willing to change my vote to neutral. |
} | ||
}); | ||
|
||
class AsyncLocalStorage { | ||
constructor() { | ||
constructor({ trackAsyncAwait = false } = {}) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not blocking comment: I'm not a big fan of user facing options which could be even named disableProblems
...
Anyhow, some doc update is needed. And maybe we should use a more clear name as async await is tracked in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My (strong) -1 to this approach still stand. Adding an option is not enough because the problematic bits are the before
and after
hooks.
I would also note that the name of the option is misleading, as async/await works fine. The problems are thenables.
I would recommend an alternative approach, something like a wrap()
method that a user can wrap a problematic thenable with to fix this problem instead of providing a blanket solution that will inevitably slow everything down.
|
Is there some way to slow down only custom thenables without impacting native promises? |
@mcollina The before and after hooks only exist when someone has explicitly asked for them now though, so that shouldn't be so much of an issue. If there are no For the option name, I'm open to whatever name people think is clearest. :) As for the manual wrap idea, that's already basically what One of the main reasons the "thenables" concept exists is to support ecosystem promise implementations like bluebird. Without support for it in async_hooks to correctly assign ids to thenables, all these libraries are needing to patch themselves to not break the world. This is problematic, especially given that this means that anyone writing and publishing a Node.js module to do anything async needs to be aware of this or they might publish something that breaks the world. That adds an extra cognitive burden to the focus of every single developer that wants to write stuff with Node.js. In its current state, async_hooks is fundamentally broken because within any awaited thenable it will have zero as a trigger id and an empty resource stack, which should be impossible. Thenables are not kicked off within a context we control so it starts off with no async_hooks context whatsoever and then could potentially create an entire orphaned chain of async behaviour beneath it. Because of how ubiquitous thenables are in userland--almost every single database or cache module uses them--async_hooks very often gets broken in user apps. And it's not just a small context loss here or there. In most cases, they lose 90% of the async graph that APM products attempt to capture. I've worked with four different APM vendors and all have had by far the largest user complaint was context loss resulting in little to know data in the APM dashboard. I've seen literally hundreds, if not thousands, of users complaining about it on the various APM products I've worked on. And those are just the ones vocal enough to bother contacting support about it or opening an issue. I'd bet many more just give up and uninstall without reporting. I get that |
@Qard do you have the simplest example of using a thenable that breaks the context, and then the additional code you'd need to fix the problem with AsyncResource? That might clarify/highlight the burden we'd be asking package maintainers to take on. |
Is there anything I can do on the V8 side to help fix this? |
I'm working on a gist explaining exactly how to reproduce the problem, what the implications are, and abstractly what needs to be solved to "correctly" handle it. I'll post it here when I'm done. |
Here's the gist explaining the problem in more detail: https://gist.github.com/Qard/faad53ba2368db54c95828365751d7bc If you have any more questions, I'll try to expand the examples further. |
@Qard, that is a great write up! One thing I thing that would be worth adding is what the developer would have to do in the first example to resolve the problem using AsyncResource. You follow up with the wrong things people do, but I think showing what we'd expect people to do for every |
Good suggestion @mhdawson. I've expanded that a bit, rearranged the doc some and elaborated in some other areas too. I'm also going to add something later elaborating on the technical details of exactly where in the Node.js core code this issue is happening and the technical reasons why it's difficult to solve. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Qard thanks for writing the problem description gist.
As the current implementation doesn't change default behavior and performance of ALS, I'm changing my review result to neutral. IMO if this fix gives ALS some users among APM tools, that would be great.
I also think that the option name could be changed to something more obvious (say, forceThenablesPropagation
or something like that?) and it should be properly documented.
@Qard it might also be worth it to add a link to the (excellent) gist in the PR description |
I'm a bit on the fence about if it should be documented at all. It'd be a step beyond just "experimental" more toward "you probably shouldn't use this unless you really know what you are doing and why" 🤔 Also, my hope is that even if this does manage to land that we can figure out a lower-level fix and therefore not even need it later. This PR is basically an experiment and demonstration that the issue can be fixed without needing to hack up V8, and also a possible temporary fix, if we decide we want it enough. |
If this PR gets landed, it should include documentation changes. The "you probably shouldn't use this unless you really know what you are doing and why" can be included in the doc section. Otherwise, this option stays invisible for users who might be interested in the workaround (APM devs), which kills the idea. As for a low-level fix, it would be great to have it instead of this workaround. |
For an end users building applications which use the one or the other framework with some plugin,... it's mostly impossible to find out if someone uses thenables. I agree with @Qard that adding and option and try to document it understandable is most likely not the best choice. If an option is added I would vote for having the variant supporting thenables enabled on default and allow users to disable it and warn in the docs about context loss. As a side note: If we go the way to allow users to trade functionality vs performance we should even allow users to disable the whole (expensive) Promise tracking as not all applications use promises. |
I would like to make my objection clear: if this was pointed out when AsyncLocalStorage landed, I would have -1 that. This should be fixed properly and not with a band-aid that has extreme performance cost. In other term, I think a Node.js without I would be ok in adding an experimental warning to |
Should this be closed in favor of #33560? |
Yep. Closing. |
This fixes the thenables issue discussed in #22360 at the AsyncLocalStorage level.
I think it may also be possible to fix it at a lower level without necessarily needing to make changes to the PromiseHook stuff in V8. That could maybe allow it to work in async_hooks too, but I think that'll take somewhat more work to figure out. I can always undo this later if I figure out the lower-level fix. 🤔
A more in-depth explanation of the problem can be found here: https://gist.github.com/Qard/faad53ba2368db54c95828365751d7bc
Checklist
make -j4 test
(UNIX), orvcbuild test
(Windows) passes