-
Notifications
You must be signed in to change notification settings - Fork 29.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for file-system based persistent code cache in user-land module loaders #47472
Comments
cc @nodejs/startup @nodejs/loaders |
For completeness, there's also https://www.npmjs.com/package/v8-compile-cache-lib with another 9 million weekly downloads; this one is used by ts-node and others. @cspotcode Given how many short lived node processes there are out there, having this sort of thing enabled globally feels like it could be a good idea (depending on the downsides). |
Code cache corruption is an issue though. V8 only performs the lightest of sanity checks. Bad inputs will crash the process, or worse. It opens up new attack vectors. |
A singular DB instead of a file per cache entry would likely be better I
would imagine. Avoids things like .pyc mismatches and can be used to add
extra checks as a container format. Attack vector introduction is real but
should likely be discussed since it might be simple to find an adequate
mitigation even if it requires opt-in.
…On Sat, Apr 8, 2023, 4:50 AM Ben Noordhuis ***@***.***> wrote:
Code cache corruption is an issue though. V8 only performs the lightest of
sanity checks. Bad inputs will crash the process, or worse. It opens up new
attack vectors.
—
Reply to this email directly, view it on GitHub
<#47472 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABZJI5CVZNMONQDGCR32O3XAEYHFANCNFSM6AAAAAAWW757XE>
.
You are receiving this because you are on a team that was mentioned.Message
ID: ***@***.***>
|
I can't think of anything new that's out of the scope in our threat model - to feed bad input to the module loader, the attacker needs to have access to the cache directory and corrupt the cache on-disk. But if the attacker has that level of access to the file system, the integrity about the actual source files already can't be trusted - unless policy is enabled, but in that case we could take policy into account in the implementation, whereas it'd be harder for any user-land solutions to do this, no matter how popular they already are. And this seems to be an even better motivation to provide this in core because the existing popular user-land solutions with ~25M weekly downloads already monkey patch If the feature is opt-in, the mitigation against any new-found venerability that actually is in our security scope (even though I think that'd be unlikely given the reasons stated above) would also be simple - the user can just stop using it (e.g. unset the environment variable), and we can quickly make a security release by making it a noop until the vulnerability is addressed, and it shouldn't result in behavioral regression - the regression would only be on the module loading performance. |
One obvious angle is running node as setuid root, or otherwise running with elevated privileges (ex. capabilities on Linux.) |
@bnoordhuis Where is the threat coming from in those cases? How would this be different compared to e.g. |
We're being careful to ignore those environment variables when running as setuid root or a host of other things. That same caution should be applied when reading files from disk. |
Yes I agree though I don't see this as a new threat - I think whatever we need is probably already covered by P.S.: we don't use P.P.S.: On the other hand we have this popular package with ~25M weekly downloads in the ecosystem that does the something similar without taking elevated privileges into account...I would say implementing it in core could probably help improving the situation with the potential threat in the ecosystem P.P.P.S.: we should probably consider exposing |
The difference is that right now we only look at environment variables that are (from an external attacker's perspective) effectively immutable. Once you start loading files as a privileged process, you have to start worrying about things like symlink attacks. I'm not saying it's impossible to make it work but it increases the attack surface considerably and it's subtle enough that it's easier to get wrong than right. edit: and setuid root is just an example. I'm sure I can come up with half a dozen more attack vectors if I put my mind to it and security isn't even my primary line of work. A dedicated attacker should be able to come up with at least half a dozen more. |
There has been no activity on this feature request for 5 months and it is unlikely to be implemented. It will be closed 6 months after the last non-automated comment. For more information on how the project manages feature requests, please consult the feature request management document. |
@joyeecheung Can we continue on this issue? I think it is extremely beneficial |
It could maybe integrate with SQLite if we add that in #50169 |
I wonder what’s the path forward. Perhaps we need to make a decision about whether the security implications are big enough a concern to implement this (I’d say however since the v8-compile-cache package is effectively used a lot in the wild, the security implications are already there in the ecosystem and implementing something like that in core would ideally make it easier to manage). Tagging it TSC-agenda. |
@joyeecheung Could maybe a baby step be providing an API for userland to implement this without needing to monkey-patch the CommonJS loader? Either something along the lines of the module customization hooks (like register a file, or register a function to run as the hook) or a new API specifically for this. Like if we don’t yet know how we would implement the persist-to-disk part of this just yet, provide some kind of API where the user can provide functions for the “save to disk” and “load from disk” parts, and core handles the “load into VM” and “export from VM” parts. Even if we eventually figure out a solution for persistence within core, like #50169, having a way to customize it might still be useful so that users can implement things like a code cache shared across processes or even across servers. |
To me it seems like providing an user land API is actually a bigger chunk of design work, considering how complicated the loaders are, and also that a user land solution using the compilation hooks still need to work with other components that we do not expose (policy, permissions, etc.). The feature proposed in the OP is more of a black box, all that's exposed to users is just one magical environment variable, all the implementation details would be up to us so it involves less API design work. Note that providing a hook does not waive the security concerns by the way, it just transfer the concerns to user-land packages, which does not even have access to our internal permission model, policy manifest, and environment variable safe guards, and therefore it would be harder for them to provide a more robust solution. |
Could it be as simple as |
What would that json file look like though? If it's still a script path -> cache path mapping, it still doesn't remove the security concerns, just transfers all that to user land. This still requires an API to pass the cache out of Node.js core for the package to persist them (and there's no guarantee about the integrity of this cache). It doesn't sound like less work, and if our eventual goal is to provide this as a built-in feature so that it can be implemented robustly with e.g. the environment variable guards we have and in a more performant way, it's extra work that eventually lead to leaked internal steps that we have to maintain. |
Or maybe it should be a path to a folder. I don’t know, it’s just a reference to wherever the data should be saved, in whatever format we would want to save it in. What I’m suggesting here is that it is a built-in feature; the user tells Node where to save the data, but everything else is handled by Node and not customizable. It would be like we put I’m not sure what you’re referring to re environment variables. |
Currently blocked on https://chromium-review.googlesource.com/c/v8/v8/+/5401780 to fix import() support. It has landed, need to wait for a couple of days to confirm it's not regressing anyone before backporting. |
Opened #52535 |
Original commit message: [compiler] reset script details in functions deserialized from code cache During the serialization of the code cache, V8 would wipe out the host-defined options, so after a script id deserialized from the code cache, the host-defined options need to be reset on the script using what's provided by the embedder when doing the deserializing compilation, otherwise the HostImportModuleDynamically callbacks can't get the data it needs to implement dynamic import(). Change-Id: I33cc6a5e43b6469d3527242e083f7ae6d8ed0c6a Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/5401780 Reviewed-by: Leszek Swirski <[email protected]> Commit-Queue: Joyee Cheung <[email protected]> Cr-Commit-Position: refs/heads/main@{#93323} Refs: v8/v8@cd10ad7 PR-URL: #52535 Refs: #47472 Reviewed-By: Benjamin Gruenbaum <[email protected]> Reviewed-By: Yagiz Nizipli <[email protected]> Reviewed-By: Mohammed Keyvanzadeh <[email protected]>
This patch implements automatic on-disk code caching that can be enabled via an environment variable NODE_COMPILE_CACHE. When set, whenever Node.js compiles a CommonJS or a ECMAScript Module, it will use on-disk [V8 code cache][] persisted in the specified directory to speed up the compilation. This may slow down the first load of a module graph, but subsequent loads of the same module graph may get a significant speedup if the contents of the modules do not change. Locally, this speeds up loading of test/fixtures/snapshot/typescript.js from ~130ms to ~80ms. To clean up the generated code cache, simply remove the directory. It will be recreated the next time the same directory is used for `NODE_COMPILE_CACHE`. Compilation cache generated by one version of Node.js may not be used by a different version of Node.js. Cache generated by different versions of Node.js will be stored separately if the same directory is used to persist the cache, so they can co-exist. Caveat: currently when using this with V8 JavaScript code coverage, the coverage being collected by V8 may be less precise in functions that are deserialized from the code cache. It's recommended to turn this off when running tests to generate precise coverage. Implementation details: There is one cache file per module on disk. The directory layout is: - Compile cache directory (from NODE_COMPILE_CACHE) - 8b23c8fe: CRC32 hash of CachedDataVersionTag + NODE_VERESION - 2ea3424d: - 10860e5a: CRC32 hash of filename + module type - 431e9adc: ... - ... Inside the cache file, there is a header followed by the actual cache content: ``` [uint32_t] code size [uint32_t] code hash [uint32_t] cache size [uint32_t] cache hash ... compile cache content ... ``` When reading the cache file, we'll also check if the code size and code hash match the code that the module loader is loading and whether the cache size and cache hash match the file content read. If they don't match, or if V8 rejects the cache passed, we'll ignore the mismatch cache, and regenerate the cache after compilation succeeds and rewrite it to disk. PR-URL: #52535 Refs: #47472 Reviewed-By: Benjamin Gruenbaum <[email protected]> Reviewed-By: Yagiz Nizipli <[email protected]> Reviewed-By: Mohammed Keyvanzadeh <[email protected]>
Fixed by #52535 |
Original commit message: [compiler] reset script details in functions deserialized from code cache During the serialization of the code cache, V8 would wipe out the host-defined options, so after a script id deserialized from the code cache, the host-defined options need to be reset on the script using what's provided by the embedder when doing the deserializing compilation, otherwise the HostImportModuleDynamically callbacks can't get the data it needs to implement dynamic import(). Change-Id: I33cc6a5e43b6469d3527242e083f7ae6d8ed0c6a Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/5401780 Reviewed-by: Leszek Swirski <[email protected]> Commit-Queue: Joyee Cheung <[email protected]> Cr-Commit-Position: refs/heads/main@{#93323} Refs: v8/v8@cd10ad7 PR-URL: nodejs#52535 Refs: nodejs#47472 Reviewed-By: Benjamin Gruenbaum <[email protected]> Reviewed-By: Yagiz Nizipli <[email protected]> Reviewed-By: Mohammed Keyvanzadeh <[email protected]>
Original commit message: [compiler] reset script details in functions deserialized from code cache During the serialization of the code cache, V8 would wipe out the host-defined options, so after a script id deserialized from the code cache, the host-defined options need to be reset on the script using what's provided by the embedder when doing the deserializing compilation, otherwise the HostImportModuleDynamically callbacks can't get the data it needs to implement dynamic import(). Change-Id: I33cc6a5e43b6469d3527242e083f7ae6d8ed0c6a Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/5401780 Reviewed-by: Leszek Swirski <[email protected]> Commit-Queue: Joyee Cheung <[email protected]> Cr-Commit-Position: refs/heads/main@{#93323} Refs: v8/v8@cd10ad7 PR-URL: #52535 Refs: #47472 Reviewed-By: Benjamin Gruenbaum <[email protected]> Reviewed-By: Yagiz Nizipli <[email protected]> Reviewed-By: Mohammed Keyvanzadeh <[email protected]> PR-URL: #52293 Reviewed-By: Moshe Atlow <[email protected]> Reviewed-By: Rafael Gonzaga <[email protected]> Reviewed-By: Richard Lau <[email protected]>
Original commit message: [compiler] reset script details in functions deserialized from code cache During the serialization of the code cache, V8 would wipe out the host-defined options, so after a script id deserialized from the code cache, the host-defined options need to be reset on the script using what's provided by the embedder when doing the deserializing compilation, otherwise the HostImportModuleDynamically callbacks can't get the data it needs to implement dynamic import(). Change-Id: I33cc6a5e43b6469d3527242e083f7ae6d8ed0c6a Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/5401780 Reviewed-by: Leszek Swirski <[email protected]> Commit-Queue: Joyee Cheung <[email protected]> Cr-Commit-Position: refs/heads/main@{#93323} Refs: v8/v8@cd10ad7 PR-URL: nodejs#52535 Refs: nodejs#47472 Reviewed-By: Benjamin Gruenbaum <[email protected]> Reviewed-By: Yagiz Nizipli <[email protected]> Reviewed-By: Mohammed Keyvanzadeh <[email protected]>
Original commit message: [compiler] reset script details in functions deserialized from code cache During the serialization of the code cache, V8 would wipe out the host-defined options, so after a script id deserialized from the code cache, the host-defined options need to be reset on the script using what's provided by the embedder when doing the deserializing compilation, otherwise the HostImportModuleDynamically callbacks can't get the data it needs to implement dynamic import(). Change-Id: I33cc6a5e43b6469d3527242e083f7ae6d8ed0c6a Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/5401780 Reviewed-by: Leszek Swirski <[email protected]> Commit-Queue: Joyee Cheung <[email protected]> Cr-Commit-Position: refs/heads/main@{#93323} Refs: v8/v8@cd10ad7 PR-URL: #52535 Refs: #47472 Reviewed-By: Benjamin Gruenbaum <[email protected]> Reviewed-By: Yagiz Nizipli <[email protected]> Reviewed-By: Mohammed Keyvanzadeh <[email protected]> PR-URL: #52293 Reviewed-By: Moshe Atlow <[email protected]> Reviewed-By: Rafael Gonzaga <[email protected]> Reviewed-By: Richard Lau <[email protected]>
Original commit message: [compiler] reset script details in functions deserialized from code cache During the serialization of the code cache, V8 would wipe out the host-defined options, so after a script id deserialized from the code cache, the host-defined options need to be reset on the script using what's provided by the embedder when doing the deserializing compilation, otherwise the HostImportModuleDynamically callbacks can't get the data it needs to implement dynamic import(). Change-Id: I33cc6a5e43b6469d3527242e083f7ae6d8ed0c6a Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/5401780 Reviewed-by: Leszek Swirski <[email protected]> Commit-Queue: Joyee Cheung <[email protected]> Cr-Commit-Position: refs/heads/main@{#93323} Refs: v8/v8@cd10ad7 PR-URL: nodejs#52535 Refs: nodejs#47472 Reviewed-By: Benjamin Gruenbaum <[email protected]> Reviewed-By: Yagiz Nizipli <[email protected]> Reviewed-By: Mohammed Keyvanzadeh <[email protected]>
Original commit message: [compiler] reset script details in functions deserialized from code cache During the serialization of the code cache, V8 would wipe out the host-defined options, so after a script id deserialized from the code cache, the host-defined options need to be reset on the script using what's provided by the embedder when doing the deserializing compilation, otherwise the HostImportModuleDynamically callbacks can't get the data it needs to implement dynamic import(). Change-Id: I33cc6a5e43b6469d3527242e083f7ae6d8ed0c6a Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/5401780 Reviewed-by: Leszek Swirski <[email protected]> Commit-Queue: Joyee Cheung <[email protected]> Cr-Commit-Position: refs/heads/main@{#93323} Refs: v8/v8@cd10ad7 PR-URL: #52535 Refs: #47472 Reviewed-By: Benjamin Gruenbaum <[email protected]> Reviewed-By: Yagiz Nizipli <[email protected]> Reviewed-By: Mohammed Keyvanzadeh <[email protected]> PR-URL: #52465 Reviewed-By: Matteo Collina <[email protected]> Reviewed-By: Rafael Gonzaga <[email protected]> Reviewed-By: Michael Dawson <[email protected]>
Original commit message: [compiler] reset script details in functions deserialized from code cache During the serialization of the code cache, V8 would wipe out the host-defined options, so after a script id deserialized from the code cache, the host-defined options need to be reset on the script using what's provided by the embedder when doing the deserializing compilation, otherwise the HostImportModuleDynamically callbacks can't get the data it needs to implement dynamic import(). Change-Id: I33cc6a5e43b6469d3527242e083f7ae6d8ed0c6a Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/5401780 Reviewed-by: Leszek Swirski <[email protected]> Commit-Queue: Joyee Cheung <[email protected]> Cr-Commit-Position: refs/heads/main@{#93323} Refs: v8/v8@cd10ad7 PR-URL: #52535 Refs: #47472 Reviewed-By: Benjamin Gruenbaum <[email protected]> Reviewed-By: Yagiz Nizipli <[email protected]> Reviewed-By: Mohammed Keyvanzadeh <[email protected]> PR-URL: #52465 Reviewed-By: Matteo Collina <[email protected]> Reviewed-By: Rafael Gonzaga <[email protected]> Reviewed-By: Michael Dawson <[email protected]>
This patch implements automatic on-disk code caching that can be enabled via an environment variable NODE_COMPILE_CACHE. When set, whenever Node.js compiles a CommonJS or a ECMAScript Module, it will use on-disk [V8 code cache][] persisted in the specified directory to speed up the compilation. This may slow down the first load of a module graph, but subsequent loads of the same module graph may get a significant speedup if the contents of the modules do not change. Locally, this speeds up loading of test/fixtures/snapshot/typescript.js from ~130ms to ~80ms. To clean up the generated code cache, simply remove the directory. It will be recreated the next time the same directory is used for `NODE_COMPILE_CACHE`. Compilation cache generated by one version of Node.js may not be used by a different version of Node.js. Cache generated by different versions of Node.js will be stored separately if the same directory is used to persist the cache, so they can co-exist. Caveat: currently when using this with V8 JavaScript code coverage, the coverage being collected by V8 may be less precise in functions that are deserialized from the code cache. It's recommended to turn this off when running tests to generate precise coverage. Implementation details: There is one cache file per module on disk. The directory layout is: - Compile cache directory (from NODE_COMPILE_CACHE) - 8b23c8fe: CRC32 hash of CachedDataVersionTag + NODE_VERESION - 2ea3424d: - 10860e5a: CRC32 hash of filename + module type - 431e9adc: ... - ... Inside the cache file, there is a header followed by the actual cache content: ``` [uint32_t] code size [uint32_t] code hash [uint32_t] cache size [uint32_t] cache hash ... compile cache content ... ``` When reading the cache file, we'll also check if the code size and code hash match the code that the module loader is loading and whether the cache size and cache hash match the file content read. If they don't match, or if V8 rejects the cache passed, we'll ignore the mismatch cache, and regenerate the cache after compilation succeeds and rewrite it to disk. PR-URL: #52535 Refs: #47472 Reviewed-By: Benjamin Gruenbaum <[email protected]> Reviewed-By: Yagiz Nizipli <[email protected]> Reviewed-By: Mohammed Keyvanzadeh <[email protected]>
So that we can use it to handle code caching in a central place. Drive-by: use per-isolate persistent strings for the parameters and mark GetHostDefinedOptions() since it's only used in one compilation unit PR-URL: #52016 Refs: #47472 Reviewed-By: Antoine du Hamel <[email protected]> Reviewed-By: Geoffrey Booth <[email protected]>
So that we can use it to handle code caching in a central place. Drive-by: use per-isolate persistent strings for the parameters and mark GetHostDefinedOptions() since it's only used in one compilation unit PR-URL: #52016 Refs: #47472 Reviewed-By: Antoine du Hamel <[email protected]> Reviewed-By: Geoffrey Booth <[email protected]>
So that we can use it to handle code caching in a central place. Drive-by: use per-isolate persistent strings for the parameters and mark GetHostDefinedOptions() since it's only used in one compilation unit PR-URL: nodejs#52016 Refs: nodejs#47472 Reviewed-By: Antoine du Hamel <[email protected]> Reviewed-By: Geoffrey Booth <[email protected]>
This refactors the code that compiles SourceTextModule for the built-in ESM loader to use a common routine so that it's easier to customize cache handling for the ESM loader. In addition this introduces a common symbol for import.meta and import() so that we don't need to create additional closures as handlers, since we can get all the information we need from the V8 callback already. This should reduce the memory footprint of ESM as well. PR-URL: nodejs#52291 Refs: nodejs#47472 Reviewed-By: Geoffrey Booth <[email protected]> Reviewed-By: Stephen Belanger <[email protected]>
Original commit message: [compiler] reset script details in functions deserialized from code cache During the serialization of the code cache, V8 would wipe out the host-defined options, so after a script id deserialized from the code cache, the host-defined options need to be reset on the script using what's provided by the embedder when doing the deserializing compilation, otherwise the HostImportModuleDynamically callbacks can't get the data it needs to implement dynamic import(). Change-Id: I33cc6a5e43b6469d3527242e083f7ae6d8ed0c6a Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/5401780 Reviewed-by: Leszek Swirski <[email protected]> Commit-Queue: Joyee Cheung <[email protected]> Cr-Commit-Position: refs/heads/main@{#93323} Refs: v8/v8@cd10ad7 PR-URL: nodejs#52535 Refs: nodejs#47472 Reviewed-By: Benjamin Gruenbaum <[email protected]> Reviewed-By: Yagiz Nizipli <[email protected]> Reviewed-By: Mohammed Keyvanzadeh <[email protected]> PR-URL: nodejs#52465 Reviewed-By: Matteo Collina <[email protected]> Reviewed-By: Rafael Gonzaga <[email protected]> Reviewed-By: Michael Dawson <[email protected]>
This refactors the code that compiles SourceTextModule for the built-in ESM loader to use a common routine so that it's easier to customize cache handling for the ESM loader. In addition this introduces a common symbol for import.meta and import() so that we don't need to create additional closures as handlers, since we can get all the information we need from the V8 callback already. This should reduce the memory footprint of ESM as well. PR-URL: nodejs#52291 Refs: nodejs#47472 Reviewed-By: Geoffrey Booth <[email protected]> Reviewed-By: Stephen Belanger <[email protected]>
This refactors the code that compiles SourceTextModule for the built-in ESM loader to use a common routine so that it's easier to customize cache handling for the ESM loader. In addition this introduces a common symbol for import.meta and import() so that we don't need to create additional closures as handlers, since we can get all the information we need from the V8 callback already. This should reduce the memory footprint of ESM as well. PR-URL: #52291 Backport-PR-URL: #53500 Refs: #47472 Reviewed-By: Geoffrey Booth <[email protected]> Reviewed-By: Stephen Belanger <[email protected]>
This stemmed from a Twitter thread. Specifically I am wondering if there are any concerns over having something similar to what https://github.com/zertosh/v8-compile-cache does in core, the general idea is:
This is also similar to what Chrome does with the V8 code cache.
The motivation for implementing this in core is that, for a user-land module to do this for CJS, it has to monkey patch the CJS loader, and this increases the compatibility burden (v8-compile-cache has 17M weekly downloads, and it needs to monkey-patch
Module.prototype._compile
to work. From a glance of its issue tracker it seems some of the issues are not really fixable in the user land either, like piping into the internal source maps cache). For ESM currently the user land can only use--loader
to customize the compilation, which has a cost of its own (especially when we move it to a separate thread), creating a disparity from CJS, and also even with--loader
I doubt if user-land code can integrate into e.g. the source map cache without asking us to expose too much internals.The most risky part of this feature might be the growth of the cache, but it seems manageable if:
This doesn't seem too radical, for example we already persist something like the repl history by default, and we also have features like
NODE_V8_COVERAGE
that does a similar "writing a lot of data to a directory when enabled" thing. I don't think this would increase the code complexity much either (we might also want a read-only version of this for SEA in the future too). So opening this issue to see if there are any concerns about having this in core before implementing it.The text was updated successfully, but these errors were encountered: