Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ESM loaders need a way to list exports in load() #47888

Closed
bengl opened this issue May 5, 2023 · 25 comments
Closed

ESM loaders need a way to list exports in load() #47888

bengl opened this issue May 5, 2023 · 25 comments
Labels
discuss Issues opened for discussions and feedbacks. esm Issues and PRs related to the ECMAScript Modules implementation. known limitation Issues that are identified as known limitations. loaders Issues and PRs related to ES module loaders

Comments

@bengl
Copy link
Member

bengl commented May 5, 2023

Hello Node.js! Today I'm wearing my Datadog APM hat!

import-in-the-middle (IITM) exists to roughly replicate the behaviour of require-in-the-middle, but for ESM. Exported bindings needed to be made editable externally to a given module. ESM doesn't allow for this like CommonJS does, so IITM stores exports for a given module in a wrapper's scope, and exports those identifiers. ESM lets modules modify their own exports dynamically, so this approach gives IITM a backdoor to modify exports in arbitrary modules. (AFAIK this is because what's actually exported is the variable binding, and not the value it's bound to.)

In IITM, the code inside the load() hook looks like this (adapted from code put together by @devsnek):

https://github.com/DataDog/import-in-the-middle/blob/b49545c48994d5268b04a96dc5c83f25a8be55b7/hook.js#L113-L131

On line 114, the module to be wrapped is directly imported in order to get the list of export names. This allows us to define a module-scoped variable for each export from the wrapped module, and re-export it in the wrapper. We also take this opportunity to register a setter that's used later on by users of IITM to modify the module.

In Node.js v20.x, this doesn't seem to work. IITM tests fail, showing that things intended to be wrapped aren't being wrapped at all. My guess is that because we're dynamically importing the intended module in the load() hook on the loader thread, previous assumptions about how the modules are loaded recursively don't apply. Even if we can safely load all of the tree in the loader thread (which might not be the case), we'd then be loading the entire dependency tree twice. Without a workaround, solutions for shimming (e.g. for APM purposes) are severely limited.

The only reason we're doing this import on line 114 is to get the list of exports that we need to make module-scoped variables for. If we can't do that, then we have a number of options:

Option 1: Use a parser in the hook

If we simply parse the source of the file, we can can find its export list statically, and not bother with the import tree at all in the loader thread. This would generally work, but including a parser bloats IITM and increases the overhead it introduces at startup.

Option 2: Provide access to Node.js' built-in parser

acorn is used by some existing Node.js features, and could be provided as a new built-in module in Node.js. If desired, it could be restricted to the loader environment. This would solve the package bloat problem for IITM, but wouldn't help with startup overhead.

Option 3: Provide a list of export names as a parameter to the load() hook

I'm not sure how this would be done. Probably with parsing? Perhaps with parsing that's already being done for other reasons? Either way, it would completely trivialize this problem.

Option 4: Discard the IITM concept

Overall, it's not great to be just subverting ESM's mutability rules in order to modify stuff post-load, but it's certainly the easiest way to map our existing CommonJS instrumentations for APM to ESM. We could re-work the concept entirely and do all our shimming in one-off loaders for each module we need to shim, but that would mean rewriting pretty much all of our instrumentation code. That's not even counting the weirder interactions between ESM apps and CommonJS dependencies that we'd have to account for here. This is an effort we'd like to avoid, since we'd like to see a future where TracingChannel enables us to avoid shimming entirely.

Option 5: Revert to (or allow for reverting to) single-threaded behaviour

If we could revert back to the single-threaded behaviour, even via a flag, this would likely be resolved for us.

@GeoffreyBooth
Copy link
Member

GeoffreyBooth commented May 5, 2023

To respond to each option:

  1. This is the only certain solution for preserving IITM’s current design.

  2. There would probably be reluctance to this, because acorn is only used in very limited circumstances (like for certain REPL things) and it’s not what gets used for actual parsing; that’s done by V8. As far as I know there’s no way to access V8’s parsing results, or else we would use those instead of acorn; however it might be possible to access V8’s linking results (what exported names it finds for each module) as I think we’re already doing this in the ESM Loader somewhere, such as in ModuleMap. If we’re accessing those from V8 already somehow, and this API isn’t already exposed, exposing it could be an option. Beware though that it might not be possible to get V8’s linking results without both parsing and evaluating the code, which would presumably eliminate this option as I assume you don’t want to risk modules’ evaluations having side effects. But you should look into it; it might be possible to ask V8 to parse and link and stop there, in which case you have a solution.

  3. This doesn’t make sense to me. How would Node know a module’s exported names before the module has been loaded? The module isn’t even read from disk until within the load hook.

  4. This really should be the goal, but I understand it may take a while to get here.

  5. Reverting moving the loaders off-thread is a nonstarter. What you could do is include logic on both sides, e.g. --import iitm --loader iitm, and communicate across the threads as needed; this should allow you to achieve the same outcome as the previous single-threaded architecture, I would think.

@Qard
Copy link
Member

Qard commented May 5, 2023

cc @nodejs/diagnostics @nodejs/loaders

@Qard Qard added esm Issues and PRs related to the ECMAScript Modules implementation. loaders Issues and PRs related to ES module loaders discuss Issues opened for discussions and feedbacks. known limitation Issues that are identified as known limitations. labels May 5, 2023
@JakobJingleheimer
Copy link
Member

  1. Reverting moving the loaders off-thread is a nonstarter. What you could do is include logic on both sides, e.g. --import iitm --loader iitm, and communicate across the threads as needed; this should allow you to achieve the same outcome as the previous single-threaded architecture, I would think.

This sounds the most promising to me. Node doesn't have the info you're looking for until later in the process, so trying to do everything in a load hook is a bit cart-before-the-horse: what exports exist are the product of the load chain, so you can't know this until the end of the chain. Even if node were to be able know before the end, introducing this would likely be a very expensive cost that everyone would have to pay regardless of whether they need it (and I think most would not need it). I think it would be better to just do it after the load chain finishes.

Is there a particular reason for restricting replacements to only exports that already existed? I think that is where your hang-up is; testdouble do something similar but, I believe, don't have that restriction—precluding this issue for them.

@benjamingr
Copy link
Member

Isn't 4 pretty simple to do? (Just publish to the diagnostic channel when an import is resolved if a hook/flag is passed)

@Qard
Copy link
Member

Qard commented May 6, 2023

5. Reverting moving the loaders off-thread is a nonstarter. What you could do is include logic on both sides, e.g. --import iitm --loader iitm, and communicate across the threads as needed; this should allow you to achieve the same outcome as the previous single-threaded architecture, I would think.

How would that work? We need to rewrite the module text to register a function to modify the value of the reference being exported so we need the names of those exports during load to know what code to generate. You can see how it currently works in the code linked by Bryan: https://github.com/DataDog/import-in-the-middle/blob/b49545c48994d5268b04a96dc5c83f25a8be55b7/hook.js#L113-L131

Isn't 4 pretty simple to do? (Just publish to the diagnostic channel when an import is resolved if a hook/flag is passed)

I have #44340 to add diagnostics_channel events to the module lifecycle, which I'll try to get to updating soon. That's not really a solution though unless we pass the module record in as mutable, which got a bunch of pushback when that was suggested before. It's also a bit counter to the intent of diagnostics_channel as it's meant to be a passive observer and not a way to mutate state. The ultimate plan, though it will take a long time, is to get modules themselves to start producing diagnostics_channel data themselves and then use module load events to know when we need to attach subscribers to channels from a particular module. That is a far off future so we need a solution for the immediate term or Node.js 20 users will not be able to use application tracers with ESM.

@Qard
Copy link
Member

Qard commented May 6, 2023

Side note: This breaks OpenTelemetry too as they also rely on import-in-the-middle for ESM support.

@GeoffreyBooth
Copy link
Member

we need a solution for the immediate term or Node.js 20 users will not be able to use application tracers with ESM.

I think the immediate solution is for IITM to add acorn or another library that can get the exported names through parsing. Maybe cache them to disk for performance. That should work right now, without any changes to Node, and then we can look into better solutions.

@targos
Copy link
Member

targos commented May 6, 2023

Using import from within the load hook really feels like something that worked by chance.

@Qard
Copy link
Member

Qard commented May 7, 2023

I think the immediate solution is for IITM to add acorn or another library that can get the exported names through parsing. Maybe cache them to disk for performance. That should work right now, without any changes to Node, and then we can look into better solutions.

Rewriting to use acorn would unfortunately probably be more work than is worthwhile for how few people are actually using ESM. It's also a rather large dependency to add just for that. I worry if that's the only fix we have it might be left broken for awhile... 😐

@bengl Do we have any bandwidth to do something about this given our other priorities this quarter?

@giltayar
Copy link
Contributor

giltayar commented May 7, 2023

May I suggest option 6? Have the Hook API call itself do the import (i.e. not the loader, but the API), and send information about the module namespace to the loader, which, when the module is loaded, use it to rewrite the module source as usual.

In this option import of the module occurs in the main thread, as it should, and not in the worker thread, which, as @targos suggested, should not really be something that is done.

Granted, communication between API and loader just got more difficult because they're in separate threads, but it is possible.

@JakobJingleheimer
Copy link
Member

May I suggest option 6? Have the Hook API call itself do the import (i.e. not the loader, but the API), and send information about the module namespace to the loader, which, when the module is loaded, use it to rewrite the module source as usual.

I think this would work but is what I suggested would be a non-trivial performance hit for something that almost no-one needs. But also, if I understand correctly, it is not dependable until the chain has finished because a subsequent loader could completely change the source, making the info retrieved from that import erroneous. Or are you suggesting we do this as the last step internally after the load chain has finished (eg in CustomizeoduleLoader::load() after final output validation)? (If so, how and to what would that info be surfaced? It would be too late for a custom loader, thus making it too late to change source)

@giltayar
Copy link
Contributor

giltayar commented May 7, 2023

@JakobJingleheimer my suggestion was a change in the IITM package, not the Node.js loader code.

@bengl
Copy link
Member Author

bengl commented May 9, 2023

@Qard It's not particularly hard to use acorn. A PoC shouldn't take terribly long. As for capacity this quarter, unblocking our users is a priority.

I think just to unblock our users here, we're going to have to implement option 1. That said, I don't think it's the ideal solution.

@giltayar Your option 6 seems like it might plausibly work. It seems like there may be some chicken-and-egg problems with it though, but I think those can be worked around. I'll play around with this, after implementing option 1 to at least unblock our users for now.

@GeoffreyBooth
Copy link
Member

@giltayar Your option 6 seems like it might plausibly work. It seems like there may be some chicken-and-egg problems with it though, but I think those can be worked around. I’ll play around with this, after implementing option 1 to at least unblock our users for now.

However you were able to do it before the off-thread change, you should still be able to do it after the off-thread change as long as you use a communications channel between the threads, which exists. It’s not very clear to me (or @targos, apparently) how the previous method worked exactly, so maybe you don’t want to patch over the off-thread stuff to recreate the previous approach, but it should be doable.

Something else to consider as you’re attempting option 1 is that you could use a parser other than acorn. There are lots of native ones now, like swc or esbuild; you could look at what Vite or its plugins use, for example, to see if there are faster libraries available now. Node itself uses https://github.com/nodejs/cjs-module-lexer for getting the named exports of CommonJS modules, and that’s written in Wasm.

I don’t really follow how the old process worked; does it dynamically import a module to read its exported names, and then that import is discarded? So then when the code does the “real” import, that parsing and evaluating is all done again or partially done again (like maybe the parsing is still in V8’s cache but the evaluation happens twice)? If there’s any inefficiency in the old method, it could very well be that a good parser might be faster overall than option 6.

@bengl
Copy link
Member Author

bengl commented Jun 2, 2023

I implemented option 1 to unblock our users in import-in-the-middle@1.4.0. I do think it's worth experimenting with passing the loaded exports from the main thread to the loader thread as suggest by @giltayar. We'll also have a look at alternative parsers when we've got some time for that.

That said, another alternative (that I doubt would be popular, and is probably very non-trivial) is to provide a mechanism within Node.js for bypassing the immutability of exports, eliminating the need for IITM altogether.

@GeoffreyBooth
Copy link
Member

That said, another alternative (that I doubt would be popular, and is probably very non-trivial) is to provide a mechanism within Node.js for bypassing the immutability of exports, eliminating the need for IITM altogether.

I vaguely recall something about the “linking” step, where the various exports of each module are attached to each other, as happening via calls to V8; and that V8 doesn’t allow these (or the modules referenced) to be changed afterward. That’s why hot module reload in ESM has been such a challenge, because we lack APIs in V8 to modify modules once they’re loaded in.

I think module parsing is a fine solution and should work well for many users, especially if you’re using native parsers like cjs-module-lexer. I would imagine that most users of apps with instrumentation libraries aren’t all that concerned about initial startup time, as I’d guess that people are mostly instrumenting long-running servers. I think the long-term solution is to eventually migrate away from mocking/proxying to using the diagnostics channel? I would think that parsing should be fine until then.

I’ll close because it seems like this is resolved, but happy to reopen if there’s anything else.

@GeoffreyBooth
Copy link
Member

@bengl I had a thought regarding this: have you considered doing the import-in-the-middle (and/or require-in-the-middle) patching at install time rather than at runtime? Something like npm install && npx datadog add-instrumentation where it searches through the dependencies and rewrites the JavaScript files the same way that import/require-in-the-middle currently patches/proxies them at runtime.

As I see it, besides the obvious benefit of making instrumentation of ESM just as good as instrumentation of CommonJS, this has a few other advantages:

  • You could use real import() to load the module, since this is a one-off script rather than the startup of the app.
  • There would be no performance cost at startup; I would think that this approach would improve startup time even for all-CommonJS apps using require-in-the-middle.
  • The code that’s run would match the code that’s on disk, so debugging is more straightforward.

Yes you’re modifying the code on disk, but I don’t see how it’s any better to modify code at runtime than ahead of time. It introduces some slight complexity if you want to turn it off, like to enable patching only in production say, but this feels manageable. Another consideration that comes to mind is the Node version; at runtime you know exactly what version you’re using, but at install/build time a different version might be in use and you’d want to ensure that the patching done is for the intended Node version. And of course you need to do the work to implement this, of using something like Acorn to be able to patch these JavaScript files safely, but it sounds like you might have already written that code for the ESM version.

@bengl
Copy link
Member Author

bengl commented Dec 5, 2023

@GeoffreyBooth for some users, that might make sense. Others may prefer the node_modules directory to be untampered for various reasons, including security scanning. Making it optional seems reasonable, defaulting to patching at runtime. The startup perf win ought to be huge, especially for serverless apps. We already support an esbuild hook that inserts instrumentation, and so this would be a similar approach. Of course, none of this will work for builtin modules, but there are far fewer of those.

For most instrumentations, Node.js version isn't particularly relevant, since the instrumented code doesn't change much, but if it does come up, we can add another switch for that.

🤔 We'll certainly consider this.

One other thing to note here is that as native ESM has become more popular for Node.js apps, APM vendors have had to catch up here, and today at least three other vendors (plus OpenTelemetry and distros of it), all use import-in-the-middle, so any approach here would have to be as vendor-agnostic as possible.

@GeoffreyBooth
Copy link
Member

GeoffreyBooth commented Dec 5, 2023

Of course, none of this will work for builtin modules, but there are far fewer of those.

You also can patch those without needing the hooks API and without monkey-patching the CommonJS loader. Something like node --require datadog app.js and then part of the code that runs on startup of datadog includes

const { patchedFs, patchedHttp } = require('datadog/patchedBuiltins');
const fs = require('node:fs');
const http = require('node:http'); // And so on for all the builtins you want to patch
const { syncBuiltinESMExports } = require('node:module');

for (const api of patchedFs) {
  fs[api] = patchedFs[api];
}
for (const api of patchedHttp) {
  http[api] = patchedHttp[api];
}

syncBuiltinESMExports();

See https://nodejs.org/api/module.html#modulesyncbuiltinesmexports, which is the API to sync whatever patching you do to builtins in CommonJS to also apply to ESM. It dates all the way back to Node 12, when we originally unflagged the ESM implementation.

And going forward, I would hope that gradually you shouldn’t need to patch any builtins as the ones you care about all get support for Diagnostics Channel. Obviously that won’t help for old Node versions, but hopefully the real version of the above script can exclude certain builtins based on the Node version.

@GeoffreyBooth
Copy link
Member

Others may prefer the node_modules directory to be untampered for various reasons

Besides patching builtins, what else do you patch? Is it essentially a predefined list of modules and exports, like “for express patch get, post, etc.; for fastify patch request, etc.”?

Because if so, could you perhaps just publish to NPM the patched versions of those libraries? Like @datadog-patched/express or something, and the user installs that instead of express. Then there’s no modification after install. The user can even install it aliased as the real thing, via npm i express@npm:@datadog-patched/express, so that any dependencies that reference plain express get the patched version.

I’m the maintainer of CoffeeScript, which has always allowed things like coffee app.coffee to transpile and evaluate at runtime, but it’s also always been strongly discouraged to do so for production apps. (It’s intended more for the shell script use case.) The official TypeScript library doesn’t even offer an equivalent; you need something like the third-party ts-node to do this for TypeScript. Coming from this background, it’s striking to me that instrumentation vendors prefer to do things at runtime, since tools like TypeScript and Babel and so on almost universally encourage an ahead-of-time build step rather than runtime transformations. Though I understand that some things can only be done at runtime.

Yet another option could be to add the instrumentation as part of some existing build process, so that you’re not modifying the node_modules folder but the patches are included in the code that gets saved and run. Is that what your esbuild plugin does?

@Qard
Copy link
Member

Qard commented Dec 5, 2023

Transitive dependencies are often a problem there. We patch many things like body-parser and find-my-way which are generally way down the dependency tree somewhere which makes it difficult to tell users to use that when they're using it indirectly through something else.

Also, the build step approach would be complicated. There are cases where patches are specific to Node.js version or OS and there are many cases like Lambda where the build doesn't necessarily happen in an environment that matches the target machine. There are also cases where we need to patch native binding code, such as with postgres, which doesn't exist in any useful form until runtime. We will always require a bunch of runtime things, we can't really escape that.

@bengl
Copy link
Member Author

bengl commented Dec 5, 2023

Besides patching builtins, what else do you patch? Is it essentially a predefined list of modules and exports, like “for express patch get, post, etc.; for fastify patch request, etc.”?

Yep. Currently we do this in our library code (i.e. on the main thread) rather than in the loader, and the loader simply enables that to happen, but that's an implementation detail.

Because if so, could you perhaps just publish to NPM the patched versions of those libraries?

That could work, and is even the norm for some other languages we support. For Node.js though, one of our design goals is to be minimally invasive to developer workflows, and this approach would be contrary to that. For example, many of our users aren't developers and are instead part of devops teams or observability teams who don't have access to the app in question apart from, say, a docker image. (Of course, the same problem can apply to modifying node_modules on disk, depending on how apps are deployed, but that's also why I mentioned it ought to be optional.) In general, operating as transparently as possible is part of our design goals.

Yet another option could be to add the instrumentation as part of some existing build process, so that you’re not modifying the node_modules folder but the patches are included in the code that gets saved and run. Is that what your esbuild plugin does?

Exactly. For folks who bundle with esbuild, we provide this as the preferred approach. (Adding support for other bundlers is on our roadmap.)

Transitive dependencies are often a problem there.

Nothing would stop us from publishing versions of fastify, etc., that depend on the patched versions of find-my-way, etc., but yeah that's even more work.

We will always require a bunch of runtime things, we can't really escape that.

💯

@GeoffreyBooth
Copy link
Member

Besides patching builtins, what else do you patch? Is it essentially a predefined list of modules and exports, like “for express patch get, post, etc.; for fastify patch request, etc.”?

Yep.

How long is this list? Perhaps the patched versions could live in your library, and within the load hook when it’s like “oh here’s body-parser/index.js, replace that with ./node_modules/datadog/patched/body-parser/index.js.“ Node could provide a helper for “get the package.json metadata for this URL” so you could know the version of body-parser that’s being loaded.

@Qard
Copy link
Member

Qard commented Dec 6, 2023

Nothing would stop us from publishing versions of fastify, etc., that depend on the patched versions of find-my-way, etc., but yeah that's even more work.

The problem with that is we currently get support for a lot of things for free because X thing we don't know/care about depends on Y thing we do have instrumentation for. This model would change that though. We would need to make custom versions with patched dependencies of tons of things we don't have any explicit support for which, to me, makes that idea seem likely untenable. The install-time patching approach is probably more realistic, but also not without its issues. I'm not sure there's a good solution other than just having runtime mutable modules, which is what I had intended to do with diagnostics_channel events of the module loading lifecycle as loaders are really designed around source text modification, not runtime object modification.

@jsumners-nr
Copy link

Regarding this idea of shipping patched source files within the APM packages's payload and returning them from the resolve step:

  1. This sounds like a minefield around complying with licenses.
  2. It's a lot of work to keep track of all various changes that can occur in each supported library in order to ship them in the APM's package. Wrapping up the instrumented thing's whatever avoids this because the APM library is always using whatever code ships from the instrumented thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Issues opened for discussions and feedbacks. esm Issues and PRs related to the ECMAScript Modules implementation. known limitation Issues that are identified as known limitations. loaders Issues and PRs related to ES module loaders
Projects
None yet
Development

No branches or pull requests

8 participants