Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File-based Blob implementation #45188

Closed
flakey5 opened this issue Oct 26, 2022 · 30 comments
Closed

File-based Blob implementation #45188

flakey5 opened this issue Oct 26, 2022 · 30 comments
Labels
feature request Issues that request new features to be added to Node.js.

Comments

@flakey5
Copy link
Member

flakey5 commented Oct 26, 2022

What is the problem this feature will solve?

As of now, the Blob class only supports data in memory while the spec allows it to be both memory and file based. This has the possibility to use up a lot of memory when writing a lot of data. An example of this would be in the motivator for this change: undici's multipart/form-data support where it stores the entire response file in memory (https://github.com/nodejs/undici/pull/1606/files#diff-bd6db9a8bceb4e9fce3f93ba5284c81f46a44a1187909e8609006e1efeab2570R427).

What is the feature you are proposing to solve the problem?

Adding support to Blob to work either off of memory or a file.

  • C++ side: keep the current Blob class the same and make a separate class called FileBlob.
    • Implements the exact same api, only difference is it would use a file (through std::fstream or something) rather than a vector.
  • Javascript side: No breaking changes to the existing Javascript api. It would use the exact same Blob class that already exists, just with its native handle pointing to a FileBlob instance rather than Blob.
  • Javascript side: Create method fs.getFileBlob('some-file-path.txt') to get a Blob instance that is file-based
    • If no file path is provided, a file is created in the temp directory under some random name

I'm currently working on implementing this and opened this issue to get some feedback on it.
cc @mcollina

What alternatives have you considered?

No response

@flakey5 flakey5 added the feature request Issues that request new features to be added to Node.js. label Oct 26, 2022
@targos
Copy link
Member

targos commented Oct 26, 2022

Javascript side: Create method fs.getFileBlob('some-file-path.txt') to get a Blob instance that is file-based

Shouldn't we directly get a File?

@targos
Copy link
Member

targos commented Oct 26, 2022

I would be very happy to have web-compatible APIs on the JS fs side (to get File and maybe also FileList objects)

@mcollina
Copy link
Member

@KhafraDev can you take a look at how this proposal matches #45139?

My 2 cents are that these are complementary. The problem with File as defined in #45139 and by the spec is that it requires the entire file to be read beforehand. This is a no-go for most use cases Node.js is perfect for.

We want fs.getFileBlob() to return a File object but disk-baked, not memory-baked.

@targos
Copy link
Member

targos commented Oct 26, 2022

by the spec [...] it requires the entire file to be read beforehand.

I'm not sure that's true. We are able to read huge files in the browser (using file.stream()), so I would be surprised if it was read entirely beforehand.

@targos
Copy link
Member

targos commented Oct 26, 2022

We want fs.getFileBlob() to return a File object but disk-baked, not memory-baked.

A File is a Blob (it extends it). To me it makes for sense to get a disk-based File than a disk-based raw Blob. The only difference is a few small fields (like name and size).

@mcollina
Copy link
Member

I'm not sure that's true. We are able to read huge files in the browser (using file.stream()), so I would be surprised if it was read entirely beforehand.

The only public constructor of File requires all the bits to be handed to it: https://developer.mozilla.org/en-US/docs/Web/API/File/File. The browsers use a non-spec-covered API to create disk-based File.

A File is a Blob (it extends it). To me it makes for sense to get a disk-based File than a disk-based raw Blob. The only difference is a few small fields (like name and size).

The difference is significant from an implementer's perspective. Our Blob sits on top of memory:

node/src/node_blob.cc

Lines 90 to 126 in 6b004f1

void Blob::New(const FunctionCallbackInfo<Value>& args) {
Environment* env = Environment::GetCurrent(args);
CHECK(args[0]->IsArray()); // sources
CHECK(args[1]->IsUint32()); // length
std::vector<BlobEntry> entries;
size_t length = args[1].As<Uint32>()->Value();
size_t len = 0;
Local<Array> ary = args[0].As<Array>();
for (size_t n = 0; n < ary->Length(); n++) {
Local<Value> entry;
if (!ary->Get(env->context(), n).ToLocal(&entry))
return;
CHECK(entry->IsArrayBufferView() || Blob::HasInstance(env, entry));
if (entry->IsArrayBufferView()) {
Local<ArrayBufferView> view = entry.As<ArrayBufferView>();
CHECK_EQ(view->ByteOffset(), 0);
std::shared_ptr<BackingStore> store = view->Buffer()->GetBackingStore();
size_t byte_length = view->ByteLength();
view->Buffer()->Detach(); // The Blob will own the backing store now.
entries.emplace_back(BlobEntry{std::move(store), byte_length, 0});
len += byte_length;
} else {
Blob* blob;
ASSIGN_OR_RETURN_UNWRAP(&blob, entry);
auto source = blob->entries();
entries.insert(entries.end(), source.begin(), source.end());
len += blob->length();
}
}
CHECK_EQ(length, len);
BaseObjectPtr<Blob> blob = Create(env, entries, length);
if (blob)
args.GetReturnValue().Set(blob->object());
}
. This change would require to be a completely different implementation of both Blob and File. That one requires all the bits specified in the constructor (same as our Blob implementation). This implementation should be entirely disk-baked with no data read.

@targos
Copy link
Member

targos commented Oct 26, 2022

I don't really care about the internal implementation (we have to do what needs to be done).
I care about what the user gets when they use the API. And from the user's perspective, IMO it makes much more sense to get a File rather than a raw Blob.

@mcollina
Copy link
Member

Agreed.

@KhafraDev
Copy link
Member

File is a rather small wrapper around Blob, so any improvements done to Blob will also benefit File.

cc @jimmywarting who asked for this in #37340

@jasnell
Copy link
Member

jasnell commented Oct 26, 2022

@flakey5 ... just some feedback on the approach here.

For now, ignore the standard File class. That's going to be secondary here.

Currently, Blob is implemented essentially at the c++ layer. The JavaScript Blob class is a thin wrapper to adapt the internal implementation to the standard API.

At the C++ layer, when the native Blob object is created, a collection of in-memory data sources are provided and normalized into a collection and held. When functions like await blob.arrayBuffer() are called, a request at the c++ layer is made to go off and assemble the collection of in-memory entries into the requested ArrayBuffer which is returned back up to the JavaScript layer.

For a file-based Blob implementation, what we need is a modification at the c++ layer that allows the data source underlying Blob to be either the in-memory collection of parts that it is now or a stream (at the c++ layer we have a thing called StreamBase which underlying many of the native layer streaming apis, including the fs modules FileHandle).

Putting it into pseudo-code, the two options would be:

// In-memory based...
auto blob = Blob::Create(collectionOfInMemoryParts);

// Stream-based...
auto blob = Blob::Create(streamBase);

When calling arrayBuffer() on the Stream-based blob the first time, instead of just memcpy'ing the contents of the in-memory components into a single ArrayBuffer, it will read from the stream and collect the returned data, assembling that into an ArrayBuffer at the end.

Keep in mind, however, that Blob is expected to be immutable and reads are idempotent. It will be necessary to retain the read data in memory. Effectively, a stream-based Blob will transition into an in-memory blob following that first read.

As for your specific questions:

C++ side: keep the current Blob class the same and make a separate class called FileBlob.
Implements the exact same api, only difference is it would use a file (through std::fstream or something) rather than a vector.

Yes, this will be the cleanest approach at the native layer. I would almost suggest calling it StreamBlob, however. It should be capable of being backed by any StreamBase-based object. Because of the idempotent read requirement it would effectively stream into a vector that would contain the constituent parts that will be retained in memory. The key difference is that pulling that data into memory is lazy, only done when the blob is actually read.

Javascript side: No breaking changes to the existing Javascript api. It would use the exact same Blob class that already exists, just with its native handle pointing to a FileBlob instance rather than Blob.

Correct.

Javascript side: Create method fs.getFileBlob('some-file-path.txt') to get a Blob instance that is file-based

This is one way of handling it, yes. Another way would be to add a new 'blob() method to the existing [FileHandle`](https://nodejs.org/dist/latest-v19.x/docs/api/fs.html#class-filehandle) object. The API would be something like:

import {fs} from 'node:fs/promises';
const fh = fs.open('/path/to/a/file');
const blob = fh.blob();
const ab = await blob.arrayBuffer();

Keep in mind, however, that the new Blob could be backed by any streaming data source, including network sources, but we don't have to provide an API for that right away. That can come later.

If no file path is provided, a file is created in the temp directory under some random name

I wouldn't recommend this, to be honest.

Now, returning to the question of File. It looks like we'll now have an implementation of the File class in Node.js with #45139, which is fantastic. Done properly, the underlying implementation of StreamBlob here should be able to sit equally well under the JavaScript File class as it does the JavaScript Blob class. There's absolutely no reason why that can't just be seamless and transparent. The JavaScript constructors for both Blob and File assume that the data is all in-memory, but we are not limited to using those constructors and can provide other ways of acquiring instances.

Given the example above, there's no reason why the fh.blob() can't return an instance of the standard File class, while other uses of StreamBlob only return a regular Blob.

import {fs} from 'node:fs/promises';
const fh = fs.open('/path/to/a/file');
const blob = fh.blob();
const ab = await blob.arrayBuffer();

console.log(blob instanceof Blob);  // True
console.log(blob instanceof File);  // True

@mcollina
Copy link
Member

I'm not sure we should do a StreamBase-backed Blob. I would create the stream on first access.

@jimmywarting
Copy link

jimmywarting commented Oct 26, 2022

#37340 is mostly a dupl of this.

the spec is that it requires the entire file to be read beforehand

Keep in mind, however, that Blob is expected to be immutable and reads are idempotent.

No browser dose not fully comply with Blob/File specification and dose some reasonable tradeoff for efficiency instead.
remembered spec saying something in the lines of "create a copy/snapshot of the file" and "any modification to that file afterwards should not affect the File class instance afterwards"

So I have tested and confirmed if browser do follow the spec.

  • I took a relative large file
  • selected it via: input[type=file]
  • created a read stream blob.stream().getReader() ( to open a file descriptor - fd ) and read only one chunk

then i tried to modify the file in any way possible after i had open a readable stream.

  • I renamed the file to see if it could still read the same file ( I was able to read it, cuz it had open a file descriptor )
  • i then tried to truncated and appended and only modify bytes without changing the last modified date to see if it kept on reading after opening a FD or if it would fail
    • if i remember correctly it never read more than the file size and stop reading once it got over the file.size
    • And when i truncated the file i would get fewer bytes.
    • modifying a single byte without changing last modified date resulted in different result every time i re-read it.

have a old demo here that i tried running in different browsers: https://jsfiddle.net/3zsecwLg/


Then there where other behavior when i just used file.arrayBuffer() & file.text()

Again i got a file references from a input[type=file].files[0]

  1. then i changed last modified date, renamed the file, write/truncate data to the file after i had gotten a file references
  2. and then every time i tried to read the file using text, arrayBuffer or stream() it failed cuz the file could no longer be found or the size or last modified date had been changed.

I even tried to do something like modified = new File([fromDisc, "random str"]) and then afterwards i did some modification to the file (fromDisc) again, and it did never once tried to make a copy/snapshot out of the fromDisc

trying to read modified did always throw an error afterwards

So for anyone else wondering: Browsers do cut corners and that's okey if it means that it dose not have to copy a 4GiB large file. and you can't hold that much in memory.
Some Files in browsers are just "file handlers" pointing to some location on the disc.
The spec is more like "don't create a copy/snapshot and if any changes to the files have accrued, throw an error when trying to read it"

fetch-blob mostly match browsers behavior of how files backed up by the disc works. when creating a file from file = fileFromSync('/path') then it will only stat the file to get the last modified date and file size, anytime you tries to read this it first have to check that this haven't been changed when you want to read the file.
Nothing gets actually read up on till you actually need to read the content of the file.

@jasnell
Copy link
Member

jasnell commented Oct 26, 2022

@mcollina:

I'm not sure we should do a StreamBase-backed Blob. I would create the stream on first access.

I think that's fine, and doesn't rule out having the StreamBase-backed blob, whether the stream is lazily created on first access is an implementation detail. It doesn't change the general idea.

FWIW, the main reason I suggested a separate StreamBlob is to ensure that the new implementation doesn't negatively impact anything with the existing implementation. Just cleaner to have a separate class. But, if the new functionality can be added to the existing c++ Blob, then awesome, let's do that.

@flakey5
Copy link
Member Author

flakey5 commented Oct 27, 2022

So from what I'm gathering,

  • Keep the C++ implementation in a singular class
    • Add node::fs::FileHandler as a private member, use this for getting data from the file
    • As data is being read from file, populate Blob::store_ with it
    • Probably some implementation details I'm forgetting to include such as some internal api changes on the js side of things
  • Replace fs.getFileBlob() with FileHandle.blob() like in @jasnell's comment with
import {fs} from 'node:fs/promises';
const fh = fs.open('/path/to/a/file');
const blob = fh.blob();

@mcollina
Copy link
Member

As data is being read from file, populate Blob::store_ with it

This approach would not lead to the desired result. The whole goal is to avoid storing the data anywhere. I think you'll need two C++ classes.

@jasnell
Copy link
Member

jasnell commented Oct 27, 2022

@mcollina ... That's not entirely possible. Accessing the data from Blob is expected to be idempotent. What the spec calls a "snapshot state" taken at the time the Blob/File is created. That can be achieved in one of two ways: either we copy the data data into a temporary file and read from it repeatedly, or we buffer the read data in memory and reuse it each time the Blob data is accessed. Obviously, copying the data to a temporary file is going to have a significant perf impact if the Blob/File is read multiple times.

@mcollina
Copy link
Member

I would recommend we diverge from the spec. I doubt any browser or runtime is keeping big files completely in memory.

@jasnell
Copy link
Member

jasnell commented Oct 27, 2022

@mcollina .. what are the specific set of requirements you have in mind for this? It would help to understand a bit more of what you are expecting.

@targos
Copy link
Member

targos commented Oct 27, 2022

As a data point, Chrome just doesn't allow to read the file after it was changed:

Uncaught (in promise) DOMException: The requested file could not be read, typically due to permission problems that have occurred after a reference to a file was acquired.

@jimmywarting
Copy link

jimmywarting commented Oct 27, 2022

☝️ @targos i have come to that conclusion as well as i mention in my previous comment.

It will throw the same error even if you make a new Blob before changing the content. (no mather how small the file is)

var file = input.files[0]
console.log( await file.text() ) // Works fine
var copy = new Blob([file])
console.log( await copy.text() ) // Works fine
await change_the_content_of_the_file
await copy.text() // thorws an error

the blob parts are just references points to where the data is located on the disc of how to read the content (with an optional offset & size - for when you slice the blob)

I would recommend we diverge from the spec

dito, think we should do what chrome also dose

@mcollina
Copy link
Member

what are the specific set of requirements you have in mind for this?

In fetch() if we receive a multipart response that includes a file, we want to store it on disk and return a File object pointing to that file. This will allow us to avoid massive allocations due to the requirement of holding the file in memory. The goal is to not have the whole file in memory.

@jasnell
Copy link
Member

jasnell commented Oct 27, 2022

Well, Chromium's implementation is far more complicated than that but the point is taken. We can, I suppose, refuse to read the File-backed Blob if the underlying file state has changed, which would achieve the idempotency requirement. We should, however, still try to find some way of making multiple reads of the same data less expensive.

As an aside, I do have concerns about state management complexity. If the node.js process is abruptly terminated, are temp files going to be kept around without being cleaned up?

@jimmywarting
Copy link

jimmywarting commented Oct 27, 2022

a useful shortcut if anyone would just do new Response(blob).blob() would just be to return a new blob.slice() instead and never read/write any data to/from disc/memory. But that is just wishful thinking and also just bad practices by a developer

old stuff

My concern is how you pre allocate memory while streaming large blob/files #42264
the stream method is not so stream friendly atm (see line 332)

node/lib/internal/blob.js

Lines 322 to 348 in 9c13e05

/**
* @returns {ReadableStream}
*/
stream() {
if (!isBlob(this))
throw new ERR_INVALID_THIS('Blob');
const self = this;
return new lazyReadableStream({
async start() {
this[kState] = await self.arrayBuffer();
this[kIndex] = 0;
},
pull(controller) {
if (this[kState].byteLength - this[kIndex] <= kMaxChunkSize) {
controller.enqueue(new Uint8Array(this[kState], this[kIndex]));
controller.close();
this[kState] = undefined;
} else {
controller.enqueue(new Uint8Array(this[kState], this[kIndex], kMaxChunkSize));
this[kIndex] += kMaxChunkSize;
}
}
});
}
}

a better streaming method from files would be necessary.

@jasnell
Copy link
Member

jasnell commented Oct 27, 2022

Ok, @flakey5 ... based on this, tonight let's walk through an alternative approach to implementing. It will be possible to use the same underlying c++ Blob object but the details of how the underlying BlobEntry bits work will change.

@mcollina
Copy link
Member

As an aside, I do have concerns about state management complexity. If the node.js process is abruptly terminated, are temp files going to be kept around without being cleaned up?

I plan to use something like https://github.com/mcollina/on-exit-leak-free to unlink the file as soon as the File object is collected or when the process exits. The file will be stored in the OS temporary folder anyway, so hard crashes will not leak files on disk long term.

@jimmywarting

This comment was marked as off-topic.

@jimmywarting
Copy link

Any progress on this so far?

@jasnell
Copy link
Member

jasnell commented Jan 2, 2023

Yep. We have an open pr that I need to get finished up this week.

@Aschen
Copy link
Contributor

Aschen commented Mar 18, 2023

Just in case someone come around looking for a workaround, in the meantime. we made an implementation of this feature as a LazyBlob class compatible with the original Blob.

@jimmywarting
Copy link

jimmywarting commented Mar 18, 2023

@Aschen NodeJS v19.8 just got released with support for fs.openAsBlob(path, { type }). So we can maybe close this issue now.

And fyi: i have already develop a more spec compatible blob impl way back: fetch-blob that have this Blob/File from path option built in both sync and async

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Issues that request new features to be added to Node.js.
Projects
None yet
Development

No branches or pull requests

7 participants