Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clear the internal require statCache on an unsuccessful module load? #31803

Closed
benjamingr opened this issue Feb 14, 2020 · 24 comments
Closed

Clear the internal require statCache on an unsuccessful module load? #31803

benjamingr opened this issue Feb 14, 2020 · 24 comments
Labels
module Issues and PRs related to the module subsystem.

Comments

@benjamingr
Copy link
Member

Currently, it is impossible to write code like:

let Bluebird;
try { 
  Bluebird = require('bluebird');
} catch (e) { 
  execSync('npm install bluebird');
  Bluebird = require('bluebird');
} 

Because the cjs loader caches the results of stat calls internally so it would not look for bluebird the second time we require it.

It would be pretty easy to deal with this use case from a technical point of view by setting statCache to a new Map() on a MODULE_NOT_FOUND.

I can see several use cases like:

  • Writing repls in userland
  • Writing a "lazy loading" require hook that recovers from 'module not found' errors by npm install`

Would people be in favor of this change and enabling this?

cc @nodejs/modules

cc @BridgeAR because of #26928 that enables this specifically for the REPL.

@BridgeAR
Copy link
Member

That approach sounds fine to me. It would hurt the loading percormance of all following calls though. I'll check if it is possible to isolate the failed modules stats to only clear those.

@benjamingr benjamingr added the module Issues and PRs related to the module subsystem. label Feb 14, 2020
@ljharb
Copy link
Member

ljharb commented Feb 14, 2020

How would this be similar to ESM behavior for a similar pattern?

@mscdex
Copy link
Contributor

mscdex commented Feb 14, 2020

Perhaps there should be an explicit API to do this instead of always doing it, so that you opt-in to any potential performance issues.

@weswigham
Copy link

Rather than clearing the cache on failure (which means no longer caching failed lookups which makes repeated failed lookups very expensive), why not clear the cache on invocation of a process, fs, or child_process API (as those are liable to trigger an fs write (or just mark for some kind of lazy re-validation on next attempt?))? (In addition to providing a sort of manual clear opt-in?)

How would this be similar to ESM behavior for a similar pattern?

So for an esm equivalent:

async function main() {
  let Bluebird;
  try { 
    Bluebird = await import('bluebird');
  } catch (e) { 
    execSync('npm install bluebird');
    Bluebird = await import('bluebird');
  } 
}
main();

since each dynamic import invocation is a unique sub-module-graph, the second invocation of import is completely unrelated to the first and uses a separate cache. This isn't really tenable for cjs, and only works for esm because of the upfront graph traversal (which allows everything in the graph to share a cache). This does mean that if the community starts to rely heavily on dynamic import a lot, performance might eventually be bad, since things aren't cached between imports, but... eh?

@devsnek
Copy link
Member

devsnek commented Feb 15, 2020

since each dynamic import invocation is a unique sub-module-graph, the second invocation of import is completely unrelated to the first and uses a separate cache.

Just to be explicit, import() is required by spec to cache if it succeeds (on a per-module basis, and node actually has a stricter behavior because it caches across all modules). If it fails, it is not required to cache the failure, and node doesn't cache the failure.

@mscdex
Copy link
Contributor

mscdex commented Feb 15, 2020

Rather than clearing the cache on failure (which means no longer caching failed lookups which makes repeated failed lookups very expensive), why not clear the cache on invocation of a process, fs, or child_process API

-1 That's way too magical

@mcollina
Copy link
Member

I’m -1 of clearing the cache. A lot of modules use a “conditional loading pattern” and this change is going to slow their startup time significantly.

@bnoordhuis
Copy link
Member

Apart from technical considerations, the example introduces a kind of non-determinism (the install can fail - or not) that shouldn't be encouraged. The time for installing packages is at install time, not run time.

@targos
Copy link
Member

targos commented Feb 15, 2020

Does require.resolve have the same restriction? If not, it can be used to check if the module is there before trying to require it

@benjamingr
Copy link
Member Author

@targos

Does require.resolve have the same restriction?

It does, it goes through Module._findPath and uses the stat cache.

@bnoordhuis

The time for installing packages is at install time, not run time.

Package installation is just one use case - the only reason Node works this way at the moment is because we have caching for performance reasons and that is observable. Moreover we have already worked around it in the REPL.

@mcollina

Hmm, if it only invalidated the entries it created for the problematically required module rather than the whole cache on an error would it help?

@mscdex

Perhaps there should be an explicit API to do this instead of always doing it, so that you opt-in to any potential performance issues.

I tend to agree though I believe the current behavior is only because our performance optimization leaks a behavioral change.

@ljharb
Copy link
Member

ljharb commented Feb 15, 2020

Your OP only has a use case for runtime install, or a repl; are there others you have in mind?

@DerekNonGeneric
Copy link
Contributor

If you know that the specifier and package name match, as does in OP, you could use an alternative method (not resolve) of checking if the package exists.

// index.js

const {accessSync} = require('fs');
const {execSync} = require('child_process');
const node_modules = './node_modules/';
const specifier = 'bluebird';

let Bluebird;
try {
  // does package exist?
  accessSync(node_modules + specifier);
  // yes, so require it
  Bluebird = require(specifier);
} catch (e) {
  // no, so install it...
  execSync('npm install ' + specifier);
  // ...and require it
  Bluebird = require(specifier);
}

console.log(Bluebird);

@mcollina
Copy link
Member

Hmm, if it only invalidated the entries it created for the problematically required module rather than the whole cache on an error would it help?

It would not cause significant performance issues in that case.

@LMojica
Copy link

LMojica commented Feb 20, 2020

This problem also affects the case where the entire node_modules directory is created within node. In this case, there doesn't seem to be any workaround.

If you do something like what OP is doing:

let Bluebird;
try { 
  Bluebird = require('bluebird');
} catch (e) { 
  execSync('npm install bluebird');
  Bluebird = require('bluebird');
} 

but you do it when there is no node_modules directory in your cwd, running execSync('npm install bluebird'); will create the directory and install the module, but the statCache will already have cached that path as non-existent. The second call to require('bluebird'); will check the statCache, see that that path is already in the cache, and won't bother checking the node_modules directory.

@ljharb
Copy link
Member

ljharb commented Feb 20, 2020

What’s the use case for “npm install hasn’t been ran first”? Can you elaborate?

@devsnek
Copy link
Member

devsnek commented Feb 20, 2020

I think in a really general sense, these caches shouldn't be updated for failure cases (obviously easier to say than implement, but I think that's what we should aim for)

@bnoordhuis
Copy link
Member

That pessimizes a pattern that is fairly common in larger applications:

try {
  var debug = require('debug');  // just an example, of course
} catch (e) {
  var debug = () => {};
}

The bigger the program, the more occurrences of that pattern pop up, and the more it is punished if the results aren't in the stat cache.

@ljharb
Copy link
Member

ljharb commented Feb 20, 2020

Is that pattern often done for the same package a great many times in the graph, and also not at module level/require time?

@bnoordhuis
Copy link
Member

Yes and yes (I think - I'm not completely sure what you mean by "also not at module level/require time".)

@ljharb
Copy link
Member

ljharb commented Feb 20, 2020

Like, is that done inside function bodies, as opposed to at the top level of modules (where it'd happen once and be cached after that, and thus the perf hit wouldn't be significant)

@bnoordhuis
Copy link
Member

Ah, like that. I think it only matters at the top level because the stat cache is cleared when control returns to the event loop.

I'm currently looking at a midsize application that has two dozen instances of the above pattern (that I can find) on a total of 1,200 require() calls, so about 2% of the total.

An inexact science benchmark suggests not caching the result makes such require() calls 2.8x slower. That's not as bad as I feared.

There's a caveat though: require() searches upwards, therefore deeper directory structures are penalized more heavily.

@ljharb
Copy link
Member

ljharb commented Feb 20, 2020

In that case, someone needing to do a conditional require in this case can already do it asynchronously, just not synchronously?

@benjamingr
Copy link
Member Author

What if we only retry failed attempts without the statCache? That way we would only pay the penalty of not going through the cache failed attempts.

benjamingr pushed a commit to benjamingr/io.js that referenced this issue Feb 23, 2020
@benjamingr
Copy link
Member Author

I am going to go ahead and close this since it staled, if anyone feels strongly about supporting this - please feel free to reopen!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module Issues and PRs related to the module subsystem.
Projects
None yet
Development

No branches or pull requests