Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fs: buffer dir entries in opendir() #29893

Closed
wants to merge 6 commits into from

Conversation

addaleax
Copy link
Member

@addaleax addaleax commented Oct 8, 2019

Read up to 32 directory entries in one batch when dir.readSync()
or dir.read() are called.

This increases performance significantly, although it introduces
quite a bit of edge case complexity.

                                                             confidence improvement accuracy (*)    (**)    (***)
 fs/bench-opendir.js mode='async' dir='lib' n=100                  ***    155.93 %      ±30.05% ±40.34%  ±53.21%
 fs/bench-opendir.js mode='async' dir='test/parallel' n=100        ***    479.65 %      ±56.81% ±76.47% ±101.32%
 fs/bench-opendir.js mode='sync' dir='lib' n=100                           10.38 %      ±14.39% ±19.16%  ±24.96%
 fs/bench-opendir.js mode='sync' dir='test/parallel' n=100         ***     63.13 %      ±12.84% ±17.18%  ±22.58%
Checklist
  • make -j4 test (UNIX), or vcbuild test (Windows) passes
  • tests and/or benchmarks are included
  • commit message follows commit guidelines

@nodejs-github-bot nodejs-github-bot added c++ Issues and PRs that require attention from people who are familiar with C++. fs Issues and PRs related to the fs subsystem / file system. labels Oct 8, 2019
@addaleax addaleax added the performance Issues and PRs related to the performance of Node.js. label Oct 8, 2019
@nodejs-github-bot
Copy link
Collaborator

@mscdex
Copy link
Contributor

mscdex commented Oct 9, 2019

So this means that instead of 1 Dirent the APIs will now return an array of Dirents? If so, doesn't the documentation need to be updated? Also, wouldn't it be a good idea to have the number be configurable (via options passed to dir.read()/dir.readSync())?

@addaleax
Copy link
Member Author

addaleax commented Oct 9, 2019

So this means that instead of 1 Dirent the APIs will now return an array of Dirents? If so, doesn't the documentation need to be updated?

No, the API is unaffected and will still only return one item at a time.

Also, wouldn't it be a good idea to have the number be configurable (via options passed to dir.read()/dir.readSync()`)?

Maybe? If you intentionally want a large number of directory entries at once you probably wouldn’t use the streaming API…

src/node_dir.cc Outdated Show resolved Hide resolved
src/node_dir.h Show resolved Hide resolved
@addaleax addaleax added the author ready PRs that have at least one approval, no pending requests for changes, and a CI started. label Oct 9, 2019
@nodejs-github-bot
Copy link
Collaborator

Copy link
Member

@BridgeAR BridgeAR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty great work!

benchmark/fs/bench-opendir.js Outdated Show resolved Hide resolved
lib/internal/fs/dir.js Outdated Show resolved Hide resolved
lib/internal/fs/dir.js Outdated Show resolved Hide resolved
src/node_dir.cc Outdated Show resolved Hide resolved
@addaleax addaleax added blocked PRs that are blocked by other issues or PRs. and removed author ready PRs that have at least one approval, no pending requests for changes, and a CI started. labels Oct 9, 2019
Fishrock123 added a commit to Fishrock123/node that referenced this pull request Oct 9, 2019
This is unlikely to be necessary in any case, and causes much
unwarrented complexity when implementing further
optimizations.

Refs: nodejs#29893 (comment)
Fishrock123 added a commit that referenced this pull request Oct 9, 2019
This is unlikely to be necessary in any case, and causes much
unwarrented complexity when implementing further
optimizations.

Refs: #29893 (comment)

PR-URL: #29908
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: Ruben Bridgewater <[email protected]>
@Fishrock123
Copy link
Contributor

Ok I landed #29908, lets see what this looks like without all that extra encoding nonsense. 😅

@addaleax addaleax removed the blocked PRs that are blocked by other issues or PRs. label Oct 9, 2019
@addaleax
Copy link
Member Author

addaleax commented Oct 9, 2019

@cjihrig @devnexen @Fishrock123 I’ve rebased this and it’s quite a bit simpler now, feel free to take another look

@nodejs-github-bot
Copy link
Collaborator

@jasnell
Copy link
Member

jasnell commented Oct 9, 2019

Just thinking about it... given that opendir() is an async iteration over the directory entries, it may be good to document what happens when new entries are created while the iterator is still iterating...

e.g.

const fs = require('fs');

let n = 0;

async function print(path) {
  const dir = await fs.promises.opendir(path);
  for await (const dirent of dir) {
    fs.mkdirSync(`./foo${n++}`);
    console.log(dirent.name);
  }
}
print('./').catch(console.error);

and

const fs = require('fs');

const dir = fs.opendirSync(__dirname);

console.log(dir.readSync());
fs.mkdirSync('foo');
console.log(dir.readSync());

(the answer, of course, is that the newly created entries are not included in the iteration, but that should be documented :-) ...)

BridgeAR pushed a commit that referenced this pull request Oct 10, 2019
This is unlikely to be necessary in any case, and causes much
unwarrented complexity when implementing further
optimizations.

Refs: #29893 (comment)

PR-URL: #29908
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: Ruben Bridgewater <[email protected]>
Copy link
Contributor

@Fishrock123 Fishrock123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Also if you could amend the commit to note that it it also includes misc cleanup that would be ideal)

lib/internal/fs/dir.js Outdated Show resolved Hide resolved
@Fishrock123
Copy link
Contributor

@jasnell I believe that also applies to uv_fs_scandir aka fs.readdir(), since other threads could operate on the directory on disk between I/O calls.

Read up to 32 directory entries in one batch when `dir.readSync()`
or `dir.read()` are called.

This increases performance significantly, although it introduces
quite a bit of edge case complexity.

                                                                 confidence improvement accuracy (*)    (**)    (***)
     fs/bench-opendir.js mode='async' dir='lib' n=100                  ***    155.93 %      ±30.05% ±40.34%  ±53.21%
     fs/bench-opendir.js mode='async' dir='test/parallel' n=100        ***    479.65 %      ±56.81% ±76.47% ±101.32%
     fs/bench-opendir.js mode='sync' dir='lib' n=100                           10.38 %      ±14.39% ±19.16%  ±24.96%
     fs/bench-opendir.js mode='sync' dir='test/parallel' n=100         ***     63.13 %      ±12.84% ±17.18%  ±22.58%
@addaleax
Copy link
Member Author

Just thinking about it... given that opendir() is an async iteration over the directory entries, it may be good to document what happens when new entries are created while the iterator is still iterating...

[...]

(the answer, of course, is that the newly created entries are not included in the iteration, but that should be documented :-) ...)

That is not the answer, sorry – POSIX says:

If a file is removed from or added to the directory after the most recent call to opendir() or rewinddir(), whether a subsequent call to readdir() returns an entry for that file is unspecified.

I think it’s fair for our docs to reflect that, though. I’ve pushed c66840d for that.

@jasnell I believe that also applies to uv_fs_scandir aka fs.readdir(), since other threads could operate on the directory on disk between I/O calls.

(I assume the same goes for that, too.)

(Also if you could amend the commit to note that it it also includes misc cleanup that would be ideal)

I’m not really seeing anything unrelated to the buffering, besides maybe removing the GetAsyncWrap() fn?


read();
});
} {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@addaleax This patch seems wrong? There should be an else here... I think the benchmark you ran included readSync() along with the callback one...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, yes … funny the linter wouldn’t complain about this? The correct benchmark results for changing to setImmediate are

                                                                confidence improvement accuracy (*)    (**)   (***)
 fs/bench-opendir.js mode='callback' dir='lib' n=100                           3.40 %      ±15.95% ±21.22% ±27.62%
 fs/bench-opendir.js mode='callback' dir='test/parallel' n=100        ***    -43.16 %      ±10.70% ±14.27% ±18.66%

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This result seems strange to me, how is that worse than before?

Regardless though after further though I think that nextTick is the reasonable thing to do.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fishrock123 I think it makes sense – as you noted, the readSync() part was run unconditionally before, so its (unchanged) performance was included in the -30 %. If we only benchmark what has been slowed down, seeing more negative impact seems about right to me.

@addaleax addaleax added the author ready PRs that have at least one approval, no pending requests for changes, and a CI started. label Oct 11, 2019
@nodejs-github-bot
Copy link
Collaborator

addaleax added a commit that referenced this pull request Oct 11, 2019
Read up to 32 directory entries in one batch when `dir.readSync()`
or `dir.read()` are called.

This increases performance significantly, although it introduces
quite a bit of edge case complexity.

                                                                 confidence improvement accuracy (*)    (**)    (***)
     fs/bench-opendir.js mode='async' dir='lib' n=100                  ***    155.93 %      ±30.05% ±40.34%  ±53.21%
     fs/bench-opendir.js mode='async' dir='test/parallel' n=100        ***    479.65 %      ±56.81% ±76.47% ±101.32%
     fs/bench-opendir.js mode='sync' dir='lib' n=100                           10.38 %      ±14.39% ±19.16%  ±24.96%
     fs/bench-opendir.js mode='sync' dir='test/parallel' n=100         ***     63.13 %      ±12.84% ±17.18%  ±22.58%

PR-URL: #29893
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: David Carlier <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Jeremiah Senkpiel <[email protected]>
addaleax added a commit that referenced this pull request Oct 11, 2019
PR-URL: #29893
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: David Carlier <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Jeremiah Senkpiel <[email protected]>
@addaleax
Copy link
Member Author

Landed in 5c93aab 7812a61

@addaleax addaleax closed this Oct 11, 2019
@addaleax addaleax deleted the opendir-buffering branch October 11, 2019 21:10
targos pushed a commit that referenced this pull request Nov 8, 2019
Read up to 32 directory entries in one batch when `dir.readSync()`
or `dir.read()` are called.

This increases performance significantly, although it introduces
quite a bit of edge case complexity.

                                                                 confidence improvement accuracy (*)    (**)    (***)
     fs/bench-opendir.js mode='async' dir='lib' n=100                  ***    155.93 %      ±30.05% ±40.34%  ±53.21%
     fs/bench-opendir.js mode='async' dir='test/parallel' n=100        ***    479.65 %      ±56.81% ±76.47% ±101.32%
     fs/bench-opendir.js mode='sync' dir='lib' n=100                           10.38 %      ±14.39% ±19.16%  ±24.96%
     fs/bench-opendir.js mode='sync' dir='test/parallel' n=100         ***     63.13 %      ±12.84% ±17.18%  ±22.58%

PR-URL: #29893
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: David Carlier <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Jeremiah Senkpiel <[email protected]>
targos pushed a commit that referenced this pull request Nov 8, 2019
PR-URL: #29893
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: David Carlier <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Jeremiah Senkpiel <[email protected]>
targos pushed a commit that referenced this pull request Nov 10, 2019
Read up to 32 directory entries in one batch when `dir.readSync()`
or `dir.read()` are called.

This increases performance significantly, although it introduces
quite a bit of edge case complexity.

                                                                 confidence improvement accuracy (*)    (**)    (***)
     fs/bench-opendir.js mode='async' dir='lib' n=100                  ***    155.93 %      ±30.05% ±40.34%  ±53.21%
     fs/bench-opendir.js mode='async' dir='test/parallel' n=100        ***    479.65 %      ±56.81% ±76.47% ±101.32%
     fs/bench-opendir.js mode='sync' dir='lib' n=100                           10.38 %      ±14.39% ±19.16%  ±24.96%
     fs/bench-opendir.js mode='sync' dir='test/parallel' n=100         ***     63.13 %      ±12.84% ±17.18%  ±22.58%

PR-URL: #29893
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: David Carlier <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Jeremiah Senkpiel <[email protected]>
targos pushed a commit that referenced this pull request Nov 10, 2019
PR-URL: #29893
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: David Carlier <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Jeremiah Senkpiel <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
author ready PRs that have at least one approval, no pending requests for changes, and a CI started. c++ Issues and PRs that require attention from people who are familiar with C++. fs Issues and PRs related to the fs subsystem / file system. performance Issues and PRs related to the performance of Node.js.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants