Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

streaming / iterative fs.readdir #583

Closed
jonathanong opened this issue Jan 23, 2015 · 93 comments
Closed

streaming / iterative fs.readdir #583

jonathanong opened this issue Jan 23, 2015 · 93 comments
Assignees
Labels
feature request Issues that request new features to be added to Node.js. fs Issues and PRs related to the fs subsystem / file system. libuv Issues and PRs related to the libuv dependency or the uv binding.

Comments

@jonathanong
Copy link
Contributor

since we're in ES6 territory now, i'm thinking the sync version should be an iterable

var dirs = fs.readdirIter(__dirname);
for (dir of dirs) {

}

and have the async version be an object stream:

var stream = fs.readdirStream(__dirname);
stream.on('data', dir => )

See: nodejs/node-v0.x-archive#388

@novacrazy
Copy link

Using generators/iterators in io.js where they make sense would be a good addition.

@brendanashworth brendanashworth added the fs Issues and PRs related to the fs subsystem / file system. label Jan 24, 2015
@cjb
Copy link

cjb commented Jan 24, 2015

Does this depend on the work in joyent/libuv#1574 to be merged first? I don't think it's in libuv yet.

@vkurchatkin
Copy link
Contributor

👍

in the future asynchronous version could return an observable:

for (dir on fs.readdir(__dirname)) {

}

@hemanth
Copy link
Contributor

hemanth commented Jan 24, 2015

👍

@Qard
Copy link
Member

Qard commented Jan 24, 2015

I'm all for the new named versions. We don't want to break compatibility with existing code by changing behaviour of existing functions, but new stuff could work. I'm not sure what the policy is yet on new interfaces that deviate from node.js though.

@iojs/tc Thoughts?

@rvagg
Copy link
Member

rvagg commented Jan 26, 2015

Where else does an interface like this make sense? Would this be a one-off of would there be a flood of requests to add the same style interface to other core APIs?

@jonathanong
Copy link
Contributor Author

@rvagg I looked for any other APIs that would look like this but I couldn't find any. I think this may be a one-off.

@rvagg
Copy link
Member

rvagg commented Jan 26, 2015

I think the way forward on this is for someone here to propose a change in a PR and then we'll escalate that to a TC discussion because it would be good to cover the question of whether adopting a new style of API is desirable. If it's too much work to come up with an initial implementation then we could just elevate this issue, it'd just be more of a hypothetical discussion then.

@Qard
Copy link
Member

Qard commented Jan 26, 2015

I think the streams version makes sense. I would test the iterator design in userland though. It can apply to any sort of stream, really. And it's easy enough to abstract a stream as an iterator in userland.

@Fishrock123
Copy link
Contributor

The eventual observables thing sounds pretty rad.

@jonathanong
Copy link
Contributor Author

@Qard how would you create an iterator from a stream? streams are async... iterators are not...

@cjb i don't think it's merged. i'm going to open an issue in libuv now

@Qard
Copy link
Member

Qard commented Jan 28, 2015

Just use co? (or something similar)

@domenic
Copy link
Contributor

domenic commented Jan 28, 2015

OK, a few things:

  • Iterables make no sense here; they are synchronous. (Generators are just iterables with .throw() and .return() methods; they don't make any sense either.)
  • There are vague designs by one TC39 committee member for adding RxJS-style observables to the language, along with for-on syntax. I doubt that will make the ES2016 train since no implementers have yet expressed interest, but I could be wrong. Regardless, it's very tentative. (There's another competing proposal, async iterables, that might be a better fit for this use case in particular.)
  • In the meantime, Node.js has an awkward version of observables/async-iterables/etc. already: object-mode streams. That's clearly the right choice for this situation, even if in the long run there becomes a language-level primitive for plural asynchronous values (similar to promises being the primitive for singular asynchronous values, or iterables being the primitive for plural synchronous values).

So I don't think this should be a new style of API. Just an object mode stream is fine.

@timoxley
Copy link
Contributor

Whoa, yeah this is a totally bogus suggestion as iterators and generators can't be used for iterating over async. This tripped me up at first as well.

But note that though it's not exactly what you had in mind fs.readFileSync(file, 'utf8') is already iterable by way of a String being iterable, and soon fs.readFileSync(file) will too be iterable as Buffers implement the iterable interface: #525

@domenic
Copy link
Contributor

domenic commented Jan 28, 2015

So re-reading the OP: readdirSync already returns an array which is iterable

@bnoordhuis
Copy link
Member

It does, but readdirSync() is (too) eager: it slurps the directory in one pass. If you apply it to a directory with 1M+ entries, it will likely bring down the process.

An iterator that walks the entries piecemeal would be great but it requires a number of changes to libuv. @misterdjules already did some of the prep work but I recall that there were still some loose ends. I'm not even sure if we reached consensus that his approach was really the best way forward.

@vkurchatkin
Copy link
Contributor

as iterators and generators can't be used for iterating over async

they can!

for (let promise of iterator) {
  let val = await promise;
}

@domenic
Copy link
Contributor

domenic commented Jan 28, 2015

@bnoordhuis I see, so you'd block on each call to .next(), instead of all at once.

@vkurchatkin yes, you can invent contracts on top of iterators. Especially ones that only work in ES2016. But they won't necessarily be respected. Someone could "forget" an await, or a .then, or just call .next() twice in a row. Now what? How do you know if it's end-of-iterable or I/O error or ...?

@domenic
Copy link
Contributor

domenic commented Jan 28, 2015

@vkurchatkin essentially you are proposing async iterable as next() : -> Iteration<Promise<T>>, whereas a more coherent design is next(): -> Promise<Iteration<T>>.

For example in your example the producer can't decide when to return or throw from the generator side, because the producer has to wait for the async process to complete before doing so, but the only way to signal completion or errors is synchronous return/throw.

@bnoordhuis
Copy link
Member

you'd block on each call to .next(), instead of all at once.

Yes, but I'm mostly concerned with memory usage; an iterator brings it down from O(n) to O(1). Currently, trying to scan a large directory exhausts the heap and terminates the process.

@vkurchatkin
Copy link
Contributor

@domenic agreed, ugly and far from ideal.

a more coherent design is next(): -> Promise<Iteration>

I can't see how this can be achieved using existing ES2015/2016 primitives. Thanks for the link, probably it has explanation.

@jonathanong
Copy link
Contributor Author

Yeah the iterative version in my head was a synchronous version that doesn't load the entire array in memory at once. But i care way more about the async version.

@Qard
Copy link
Member

Qard commented Jan 28, 2015

@domenic While generators on their own are sync, they can be used in an async way via co. Maybe you don't like the weird promise trampoline hack it uses, but it works and many people are already using it in that way. I suggested using co because I had already suggested doing the iterator thing in userland, which means those sort of tools are available. If you prefer some different async iterator module, you can write a streams abstraction on that.

I wouldn't mind this being in core, I just suggested a userland approach because I expected some members of the core team might disagree.

@jonathanong
Copy link
Contributor Author

btw let's not do the iterator design. i realized that if something like this happens:

var i = 0;
for (name of fs.readdirIter(__dirname)) {
  if (i++ > 5) throw new Error('boom');
}

assuming fs.readdirIter() creates a file descriptor, the above may cause a file descriptor leak as the iterator may remain in a paused state. this is assuming fs.readdirIter() opens a file descriptor until it's complete. having users manually close the file descriptor may be too much to ask (not worth people complaining about leaks when they don't know about this).

so... let's just do the stream version for now :D

@domenic
Copy link
Contributor

domenic commented Jan 29, 2015

@jonathanong as long as the iterator implements iterator.return() that closes the FD, there will be no leak.

@vkurchatkin
Copy link
Contributor

@domenic
Copy link
Contributor

domenic commented Jan 29, 2015

@vkurchatkin same reason throw isn't.

@vkurchatkin
Copy link
Contributor

not working:

var arr = [1,2,3,4,5];
var it = arr[Symbol.iterator]();

it.return = function() {
  cobsole.log('return');
}


for (var v of it) {
  if (v > 3) break;
  console.log(v)
}

@Fishrock123
Copy link
Contributor

@rijnhard That's gona add a whole bunch of overhead...

But to be clear, you want to: have an object stream of directory entries, transform those into file reads, in the same stream?

That doesn't really make sense to me. You end up muliplexing/splitting the stream anyways and that leads to a very similar can of worms as just "iterating" and "awaiting" until a stream is finished to pull the next entry.

Maybe you could elaborate more? I am presently avoiding making this a stream due to complexity and lack of solid use cases.

@rijnhard
Copy link

@Fishrock123 I think I misunderstood, I just need to iterate through directories and not their contents.

@paragi
Copy link

paragi commented Aug 15, 2019

While you are working with it, would it be a lot of trouble to add a function with glob or just '?' and '*' wildcard functionality?
I presume performance and memory usages would vastly improve, working on very large filesystems, if you do the scanning in libuv.
It would be much appriciatet 🥇

@Fishrock123
Copy link
Contributor

@paragi This API will do no filtering and will not return entries from subdirectories. That will be up to a module to do.

@Fishrock123
Copy link
Contributor

Ok folks, the pull request is up: #29349

const fs = require('fs');

async function print(path) {
  const dir = await fs.promises.opendir(path);
  for await (const dirent of dir) {
    console.log(dirent.name);
  }
}
print('./').catch(console.error);

Fishrock123 added a commit to Fishrock123/node that referenced this issue Oct 7, 2019
This adds long-requested methods for asynchronously interacting and
iterating through directory entries by using `uv_fs_opendir`,
`uv_fs_readdir`, and `uv_fs_closedir`.

`fs.opendir()` and friends return an `fs.Dir`, which contains methods
for doing reads and cleanup. `fs.Dir` also has the async iterator
symbol exposed.

The `read()` method and friends only return `fs.Dirent`s for this API.
Having a entry type or doing a `stat` call is deemed to be necessary in
the majority of cases, so just returning dirents seems like the logical
choice for a new api.

Reading when there are no more entries returns `null` instead of a
dirent. However the async iterator hides that (and does automatic
cleanup).

The code lives in separate files from the rest of fs, this is done
partially to prevent over-pollution of those (already very large)
files, but also in the case of js allows loading into `fsPromises`.

Due to async_hooks, this introduces a new handle type of `DIRHANDLE`.

This PR does not attempt to make complete optimization of
this feature. Notable future improvements include:
- Moving promise work into C++ land like FileHandle.
- Possibly adding `readv()` to do multi-entry directory reads.
- Aliasing `fs.readdir` to `fs.scandir` and doing a deprecation.

Refs: nodejs/node-v0.x-archive#388
Refs: nodejs#583
Refs: libuv/libuv#2057
Fishrock123 added a commit that referenced this issue Oct 8, 2019
This adds long-requested methods for asynchronously interacting and
iterating through directory entries by using `uv_fs_opendir`,
`uv_fs_readdir`, and `uv_fs_closedir`.

`fs.opendir()` and friends return an `fs.Dir`, which contains methods
for doing reads and cleanup. `fs.Dir` also has the async iterator
symbol exposed.

The `read()` method and friends only return `fs.Dirent`s for this API.
Having a entry type or doing a `stat` call is deemed to be necessary in
the majority of cases, so just returning dirents seems like the logical
choice for a new api.

Reading when there are no more entries returns `null` instead of a
dirent. However the async iterator hides that (and does automatic
cleanup).

The code lives in separate files from the rest of fs, this is done
partially to prevent over-pollution of those (already very large)
files, but also in the case of js allows loading into `fsPromises`.

Due to async_hooks, this introduces a new handle type of `DIRHANDLE`.

This PR does not attempt to make complete optimization of
this feature. Notable future improvements include:
- Moving promise work into C++ land like FileHandle.
- Possibly adding `readv()` to do multi-entry directory reads.
- Aliasing `fs.readdir` to `fs.scandir` and doing a deprecation.

Refs: nodejs/node-v0.x-archive#388
Refs: #583
Refs: libuv/libuv#2057

PR-URL: #29349
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: David Carlier <[email protected]>
@Fishrock123
Copy link
Contributor

Landed in cbd8d71 as fs{Promises}.opendir(), which returns an fs.Dir, which exposes an async iterator. 🎉

BridgeAR pushed a commit that referenced this issue Oct 9, 2019
This adds long-requested methods for asynchronously interacting and
iterating through directory entries by using `uv_fs_opendir`,
`uv_fs_readdir`, and `uv_fs_closedir`.

`fs.opendir()` and friends return an `fs.Dir`, which contains methods
for doing reads and cleanup. `fs.Dir` also has the async iterator
symbol exposed.

The `read()` method and friends only return `fs.Dirent`s for this API.
Having a entry type or doing a `stat` call is deemed to be necessary in
the majority of cases, so just returning dirents seems like the logical
choice for a new api.

Reading when there are no more entries returns `null` instead of a
dirent. However the async iterator hides that (and does automatic
cleanup).

The code lives in separate files from the rest of fs, this is done
partially to prevent over-pollution of those (already very large)
files, but also in the case of js allows loading into `fsPromises`.

Due to async_hooks, this introduces a new handle type of `DIRHANDLE`.

This PR does not attempt to make complete optimization of
this feature. Notable future improvements include:
- Moving promise work into C++ land like FileHandle.
- Possibly adding `readv()` to do multi-entry directory reads.
- Aliasing `fs.readdir` to `fs.scandir` and doing a deprecation.

Refs: nodejs/node-v0.x-archive#388
Refs: #583
Refs: libuv/libuv#2057

PR-URL: #29349
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: David Carlier <[email protected]>
addaleax pushed a commit to nodejs/quic that referenced this issue Oct 9, 2019
This adds long-requested methods for asynchronously interacting and
iterating through directory entries by using `uv_fs_opendir`,
`uv_fs_readdir`, and `uv_fs_closedir`.

`fs.opendir()` and friends return an `fs.Dir`, which contains methods
for doing reads and cleanup. `fs.Dir` also has the async iterator
symbol exposed.

The `read()` method and friends only return `fs.Dirent`s for this API.
Having a entry type or doing a `stat` call is deemed to be necessary in
the majority of cases, so just returning dirents seems like the logical
choice for a new api.

Reading when there are no more entries returns `null` instead of a
dirent. However the async iterator hides that (and does automatic
cleanup).

The code lives in separate files from the rest of fs, this is done
partially to prevent over-pollution of those (already very large)
files, but also in the case of js allows loading into `fsPromises`.

Due to async_hooks, this introduces a new handle type of `DIRHANDLE`.

This PR does not attempt to make complete optimization of
this feature. Notable future improvements include:
- Moving promise work into C++ land like FileHandle.
- Possibly adding `readv()` to do multi-entry directory reads.
- Aliasing `fs.readdir` to `fs.scandir` and doing a deprecation.

Refs: nodejs/node-v0.x-archive#388
Refs: nodejs/node#583
Refs: libuv/libuv#2057

PR-URL: nodejs/node#29349
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: David Carlier <[email protected]>
addaleax pushed a commit to nodejs/quic that referenced this issue Oct 9, 2019
This adds long-requested methods for asynchronously interacting and
iterating through directory entries by using `uv_fs_opendir`,
`uv_fs_readdir`, and `uv_fs_closedir`.

`fs.opendir()` and friends return an `fs.Dir`, which contains methods
for doing reads and cleanup. `fs.Dir` also has the async iterator
symbol exposed.

The `read()` method and friends only return `fs.Dirent`s for this API.
Having a entry type or doing a `stat` call is deemed to be necessary in
the majority of cases, so just returning dirents seems like the logical
choice for a new api.

Reading when there are no more entries returns `null` instead of a
dirent. However the async iterator hides that (and does automatic
cleanup).

The code lives in separate files from the rest of fs, this is done
partially to prevent over-pollution of those (already very large)
files, but also in the case of js allows loading into `fsPromises`.

Due to async_hooks, this introduces a new handle type of `DIRHANDLE`.

This PR does not attempt to make complete optimization of
this feature. Notable future improvements include:
- Moving promise work into C++ land like FileHandle.
- Possibly adding `readv()` to do multi-entry directory reads.
- Aliasing `fs.readdir` to `fs.scandir` and doing a deprecation.

Refs: nodejs/node-v0.x-archive#388
Refs: nodejs/node#583
Refs: libuv/libuv#2057

PR-URL: nodejs/node#29349
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: David Carlier <[email protected]>
addaleax pushed a commit to nodejs/quic that referenced this issue Oct 11, 2019
This adds long-requested methods for asynchronously interacting and
iterating through directory entries by using `uv_fs_opendir`,
`uv_fs_readdir`, and `uv_fs_closedir`.

`fs.opendir()` and friends return an `fs.Dir`, which contains methods
for doing reads and cleanup. `fs.Dir` also has the async iterator
symbol exposed.

The `read()` method and friends only return `fs.Dirent`s for this API.
Having a entry type or doing a `stat` call is deemed to be necessary in
the majority of cases, so just returning dirents seems like the logical
choice for a new api.

Reading when there are no more entries returns `null` instead of a
dirent. However the async iterator hides that (and does automatic
cleanup).

The code lives in separate files from the rest of fs, this is done
partially to prevent over-pollution of those (already very large)
files, but also in the case of js allows loading into `fsPromises`.

Due to async_hooks, this introduces a new handle type of `DIRHANDLE`.

This PR does not attempt to make complete optimization of
this feature. Notable future improvements include:
- Moving promise work into C++ land like FileHandle.
- Possibly adding `readv()` to do multi-entry directory reads.
- Aliasing `fs.readdir` to `fs.scandir` and doing a deprecation.

Refs: nodejs/node-v0.x-archive#388
Refs: nodejs/node#583
Refs: libuv/libuv#2057

PR-URL: nodejs/node#29349
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: David Carlier <[email protected]>
@Fishrock123
Copy link
Contributor

This has been released in 12.12.0

@ma11hew28
Copy link
Contributor

ma11hew28 commented Nov 22, 2019

What about also making directories sync iterable (as initially suggested)?

I think this could be done by using dir.readSync().

The commit message could be "fs: make directories sync iterable".

@frank-dspeed
Copy link
Contributor

@ma11hew28 sorry to tell you that Sync can't be iterable as its Sync :) a iterable is a async type

@Qard
Copy link
Member

Qard commented Nov 27, 2019

There are sync iterators too. It's entirely possible to make a sync version.

@frank-dspeed
Copy link
Contributor

@Qard why should i use a iterator for a Sync call that will return after all is in mermory already but ok your right it could exist it can be done. i for my self would suggest for...of as iterate method but ok

@ma11hew28
Copy link
Contributor

ma11hew28 commented Nov 27, 2019

Thank you, @frank-dspeed and @Qard, for responding. I'm sorry for not responding promptly. @frank-dspeed, I'm sorry, but I think you misunderstood me. What you suggested is what I meant, as it is the first part of what @jonathanong initially suggested. Ie, if we make directories sync iterable, then you could do something like this:

const fs = require('fs')

const dir = fs.opendirSync('.')
for (const dirent of dir) {
  console.log(dirent.name)
}

instead of something like this:

const fs = require('fs')

const dir = fs.opendirSync('.')
let dirent
while ((dirent = dir.readSync()) !== null) {
  console.log(dirent.name)
}
dir.closeSync()

As for the second sentence of my initial comment on this issue, by "this", I meant "making directories sync iterable". Ie, I think a directory's default sync iterator's next() method could call dir.readSync() to get the directory's next entry.

@Qard
Copy link
Member

Qard commented Nov 28, 2019

@frank-dspeed It doesn't have to all be loaded into memory with a sync iterator. If you have a directory with millions of entries in it, a sync iterator could read and return entries one at a time, or in batches, but synchronously.

juanarbol pushed a commit to juanarbol/quic that referenced this issue Dec 17, 2019
This adds long-requested methods for asynchronously interacting and
iterating through directory entries by using `uv_fs_opendir`,
`uv_fs_readdir`, and `uv_fs_closedir`.

`fs.opendir()` and friends return an `fs.Dir`, which contains methods
for doing reads and cleanup. `fs.Dir` also has the async iterator
symbol exposed.

The `read()` method and friends only return `fs.Dirent`s for this API.
Having a entry type or doing a `stat` call is deemed to be necessary in
the majority of cases, so just returning dirents seems like the logical
choice for a new api.

Reading when there are no more entries returns `null` instead of a
dirent. However the async iterator hides that (and does automatic
cleanup).

The code lives in separate files from the rest of fs, this is done
partially to prevent over-pollution of those (already very large)
files, but also in the case of js allows loading into `fsPromises`.

Due to async_hooks, this introduces a new handle type of `DIRHANDLE`.

This PR does not attempt to make complete optimization of
this feature. Notable future improvements include:
- Moving promise work into C++ land like FileHandle.
- Possibly adding `readv()` to do multi-entry directory reads.
- Aliasing `fs.readdir` to `fs.scandir` and doing a deprecation.

Refs: nodejs/node-v0.x-archive#388
Refs: nodejs/node#583
Refs: libuv/libuv#2057

PR-URL: nodejs/node#29349
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: David Carlier <[email protected]>
juanarbol pushed a commit to juanarbol/quic that referenced this issue Dec 17, 2019
This adds long-requested methods for asynchronously interacting and
iterating through directory entries by using `uv_fs_opendir`,
`uv_fs_readdir`, and `uv_fs_closedir`.

`fs.opendir()` and friends return an `fs.Dir`, which contains methods
for doing reads and cleanup. `fs.Dir` also has the async iterator
symbol exposed.

The `read()` method and friends only return `fs.Dirent`s for this API.
Having a entry type or doing a `stat` call is deemed to be necessary in
the majority of cases, so just returning dirents seems like the logical
choice for a new api.

Reading when there are no more entries returns `null` instead of a
dirent. However the async iterator hides that (and does automatic
cleanup).

The code lives in separate files from the rest of fs, this is done
partially to prevent over-pollution of those (already very large)
files, but also in the case of js allows loading into `fsPromises`.

Due to async_hooks, this introduces a new handle type of `DIRHANDLE`.

This PR does not attempt to make complete optimization of
this feature. Notable future improvements include:
- Moving promise work into C++ land like FileHandle.
- Possibly adding `readv()` to do multi-entry directory reads.
- Aliasing `fs.readdir` to `fs.scandir` and doing a deprecation.

Refs: nodejs/node-v0.x-archive#388
Refs: nodejs/node#583
Refs: libuv/libuv#2057

PR-URL: nodejs/node#29349
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: David Carlier <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Issues that request new features to be added to Node.js. fs Issues and PRs related to the fs subsystem / file system. libuv Issues and PRs related to the libuv dependency or the uv binding.
Projects
None yet
Development

No branches or pull requests