-
Notifications
You must be signed in to change notification settings - Fork 29.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
File-based Blob implementation #45188
Comments
Shouldn't we directly get a |
I would be very happy to have web-compatible APIs on the JS |
@KhafraDev can you take a look at how this proposal matches #45139? My 2 cents are that these are complementary. The problem with We want |
I'm not sure that's true. We are able to read huge files in the browser (using |
A |
The only public constructor of
The difference is significant from an implementer's perspective. Our Lines 90 to 126 in 6b004f1
Blob and File . That one requires all the bits specified in the constructor (same as our Blob implementation). This implementation should be entirely disk-baked with no data read.
|
I don't really care about the internal implementation (we have to do what needs to be done). |
Agreed. |
File is a rather small wrapper around Blob, so any improvements done to Blob will also benefit File. cc @jimmywarting who asked for this in #37340 |
@flakey5 ... just some feedback on the approach here. For now, ignore the standard Currently, At the C++ layer, when the native For a file-based Blob implementation, what we need is a modification at the c++ layer that allows the data source underlying Putting it into pseudo-code, the two options would be:
When calling Keep in mind, however, that As for your specific questions:
Yes, this will be the cleanest approach at the native layer. I would almost suggest calling it
Correct.
This is one way of handling it, yes. Another way would be to add a new 'blob() import {fs} from 'node:fs/promises';
const fh = fs.open('/path/to/a/file');
const blob = fh.blob();
const ab = await blob.arrayBuffer(); Keep in mind, however, that the new
I wouldn't recommend this, to be honest. Now, returning to the question of Given the example above, there's no reason why the import {fs} from 'node:fs/promises';
const fh = fs.open('/path/to/a/file');
const blob = fh.blob();
const ab = await blob.arrayBuffer();
console.log(blob instanceof Blob); // True
console.log(blob instanceof File); // True |
I'm not sure we should do a StreamBase-backed Blob. I would create the stream on first access. |
#37340 is mostly a dupl of this.
No browser dose not fully comply with Blob/File specification and dose some reasonable tradeoff for efficiency instead. So I have tested and confirmed if browser do follow the spec.
then i tried to modify the file in any way possible after i had open a readable stream.
have a old demo here that i tried running in different browsers: https://jsfiddle.net/3zsecwLg/ Then there where other behavior when i just used Again i got a file references from a
I even tried to do something like trying to read So for anyone else wondering: Browsers do cut corners and that's okey if it means that it dose not have to copy a 4GiB large file. and you can't hold that much in memory. fetch-blob mostly match browsers behavior of how files backed up by the disc works. when creating a file from |
I think that's fine, and doesn't rule out having the FWIW, the main reason I suggested a separate |
So from what I'm gathering,
import {fs} from 'node:fs/promises';
const fh = fs.open('/path/to/a/file');
const blob = fh.blob(); |
This approach would not lead to the desired result. The whole goal is to avoid storing the data anywhere. I think you'll need two C++ classes. |
@mcollina ... That's not entirely possible. Accessing the data from Blob is expected to be idempotent. What the spec calls a "snapshot state" taken at the time the Blob/File is created. That can be achieved in one of two ways: either we copy the data data into a temporary file and read from it repeatedly, or we buffer the read data in memory and reuse it each time the Blob data is accessed. Obviously, copying the data to a temporary file is going to have a significant perf impact if the Blob/File is read multiple times. |
I would recommend we diverge from the spec. I doubt any browser or runtime is keeping big files completely in memory. |
@mcollina .. what are the specific set of requirements you have in mind for this? It would help to understand a bit more of what you are expecting. |
As a data point, Chrome just doesn't allow to read the file after it was changed:
|
☝️ @targos i have come to that conclusion as well as i mention in my previous comment. It will throw the same error even if you make a new Blob before changing the content. (no mather how small the file is) var file = input.files[0]
console.log( await file.text() ) // Works fine
var copy = new Blob([file])
console.log( await copy.text() ) // Works fine
await change_the_content_of_the_file
await copy.text() // thorws an error the blob parts are just references points to where the data is located on the disc of how to read the content (with an optional offset & size - for when you slice the blob)
dito, think we should do what chrome also dose |
In fetch() if we receive a multipart response that includes a file, we want to store it on disk and return a |
Well, Chromium's implementation is far more complicated than that but the point is taken. We can, I suppose, refuse to read the File-backed Blob if the underlying file state has changed, which would achieve the idempotency requirement. We should, however, still try to find some way of making multiple reads of the same data less expensive. As an aside, I do have concerns about state management complexity. If the node.js process is abruptly terminated, are temp files going to be kept around without being cleaned up? |
a useful shortcut if anyone would just do old stuffMy concern is how you pre allocate memory while streaming large blob/files #42264 Lines 322 to 348 in 9c13e05
a better streaming method from files would be necessary. |
Ok, @flakey5 ... based on this, tonight let's walk through an alternative approach to implementing. It will be possible to use the same underlying c++ |
I plan to use something like https://github.com/mcollina/on-exit-leak-free to unlink the file as soon as the |
This comment was marked as off-topic.
This comment was marked as off-topic.
Any progress on this so far? |
Yep. We have an open pr that I need to get finished up this week. |
Just in case someone come around looking for a workaround, in the meantime. we made an implementation of this feature as a LazyBlob class compatible with the original |
@Aschen NodeJS v19.8 just got released with support for And fyi: i have already develop a more spec compatible blob impl way back: fetch-blob that have this Blob/File from path option built in both sync and async |
What is the problem this feature will solve?
As of now, the
Blob
class only supports data in memory while the spec allows it to be both memory and file based. This has the possibility to use up a lot of memory when writing a lot of data. An example of this would be in the motivator for this change: undici's multipart/form-data support where it stores the entire response file in memory (https://github.com/nodejs/undici/pull/1606/files#diff-bd6db9a8bceb4e9fce3f93ba5284c81f46a44a1187909e8609006e1efeab2570R427).What is the feature you are proposing to solve the problem?
Adding support to
Blob
to work either off of memory or a file.Blob
class the same and make a separate class calledFileBlob
.std::fstream
or something) rather than a vector.Blob
class that already exists, just with its native handle pointing to aFileBlob
instance rather thanBlob
.fs.getFileBlob('some-file-path.txt')
to get aBlob
instance that is file-basedI'm currently working on implementing this and opened this issue to get some feedback on it.
cc @mcollina
What alternatives have you considered?
No response
The text was updated successfully, but these errors were encountered: