Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ReadableStream.ReadAsync doesn't respect buffer lengths #3

Open
sschoener opened this issue Dec 9, 2023 · 8 comments
Open

ReadableStream.ReadAsync doesn't respect buffer lengths #3

sschoener opened this issue Dec 9, 2023 · 8 comments

Comments

@sschoener
Copy link

Hello there! I have attempted to use your FileSystemAccess library in Blazor -- thank you for your work on all this. I have noticed that the stream returned by File.StreamAsync() doesn't work in most scenarios I have tried (the file is acquired by using your APIs to show an "Open Window dialog" etc.). For example, when you have a 90MB file and just need to read 1KB of data, you would usually use

var buffer = new byte[1024];
int numberOfBytesRead = await stream.ReadAsync(buffer);

What this should do is read 1KB worth of data. What it actually does is throw an exception because the stream implementation doesn't respect the buffer length, as you can see from the implementation. Similarly, stream.Read() (reading a single byte) doesn't work.

Here is the typical callstack:

Unhandled Exception:
System.ArgumentException: Destination is too short. (Parameter 'destination')
   at KristofferStrube.Blazor.Streams.ReadableStream.ReadAsync(Memory`1 buffer, CancellationToken cancellationToken)
   at System.IO.StreamReader.ReadBufferAsync(CancellationToken cancellationToken)
   at System.IO.StreamReader.ReadLineAsyncInternal(CancellationToken cancellationToken)
   at BlazorLineFilter.Pages.Home.RunFilter() in D:\local-repositories\BlazorLineFilter\BlazorLineFilter\Pages\Home.razor:line 36
   at BlazorLineFilter.Pages.Home.DoFiltering() in D:\local-repositories\BlazorLineFilter\BlazorLineFilter\Pages\Home.razor:line 46
   at System.Threading.Tasks.Task.<>c.<ThrowAsync>b__128_1(Object state)
   at System.Threading.QueueUserWorkItemCallbackDefaultContext.Execute()
   at System.Threading.ThreadPoolWorkQueue.Dispatch()
   at System.Threading.ThreadPool.BackgroundJobHandler()

The Destiniation is too short message likely stems from the call to (await helper.InvokeAsync<byte[]>("byteArray", jSValue)).CopyTo(buffer);.

For similar reasons, using something like new StreamReader(stream).ReadLineAsync() doesn't work, unless you set the StreamReader buffer size to a value that sits above the size of the buffer that the browser has chosen. For example, I can get consistent failures when I set the buffer size of a StreamReader to 1MB but it works with 16MB. For now, that is, until Chrome changes its implementation and use larger buffers in the future :)

This is ultimately due to a difference between how JavaScript and C# choose to handle I/O: In C#, I get to choose the buffer size (the reader might still have its own buffer somewhere else for efficiency, but it'll handle whatever buffer size I give it). In Javascript, the browser gets to choose the buffer size and the read functions just spits out chunk in the sizes that the browser has chosen.

One option for fixing this could be to add an internally buffer to the ReadableStream itself or the reader types.

@KristofferStrube
Copy link
Owner

Hey @sschoener, Thank you so much for your interest in the library.

You have some very interesting points. I had not thought of this use case before as I have primarily seen people using the library to stream the entirety of a file.

I've read The documentation for the ReadAsync method and as you state it should either return some number equal to or less than the buffer size. And if no bytes were read, then return 0. Currently, it may read more than the desired size as you also state.

It is a good idea to use an internal buffer to save the rest of any read bytes so that they can be consumed later. I see it as a pretty straightforward solution but it also hides some behavior and doesn't make it directly obvious how many bytes were actually read from the stream which could make a user assume that the stream was in another state.

So before I decide on a solution I will check out whether it would be possible to read the stream in the desired way using a ReadableStreamBYOBReader as that might lead to a more transparent solution.

@sschoener
Copy link
Author

sschoener commented Dec 10, 2023

I have actually tried both now :) Find both versions below. The first needs the BYOB reader, the second the regular one. I have not included any of the members that aren't relevant.

This is the version with the BYOB reader:

class ReadStream : Stream
{
    IJSObjectReference JSReaderRef;
    IJSRuntime JSRuntime;
    long m_Size;

    public ReadStream(IJSRuntime jsRuntime, IJSObjectReference readerRef, long size)
    {
        JSReaderRef = readerRef;
        JSRuntime = jsRuntime;
        m_Size = size;
    }

    public override async ValueTask<int> ReadAsync(Memory<byte> buffer, CancellationToken cancellationToken = default)
    {
        if (MemoryMarshal.TryGetArray<byte>(buffer, out var arrSegment)) {
            var data = await JSRuntime.InvokeAsync<byte[]>("doRead", JSReaderRef, arrSegment.Array, arrSegment.Offset, arrSegment.Count);
            if (data != null)
            {
                data.CopyTo(buffer);
                return data.Length;
            }
            return 0;
        }
        throw new NotImplementedException();
    }
}

where I'm using this javascript:

<script>
    async function doRead(reader, buffer, offset, count) {
        const sub = new Uint8Array(buffer, offset, count);
        const { done, value } = (await reader.read(sub));
        if (done)
            return undefined;
        return value;
    }
</script>

Note here that the BYOB reader requires you to return the buffer that its read method returns. I think it'll only write to your buffer if it can't return a view on its internal buffer, so it tries to minimize copies... which doesn't help us in this case at all :)

Alternatively, here is the approach that just uses the buffer the regular reader returns:

class ReadStream : Stream
{
    IJSObjectReference JSReaderRef;
    IJSRuntime JSRuntime;
    long m_Size;
    int m_BufferOffset;
    byte[]? m_Buffer;

    public ReadStream(IJSRuntime jsRuntime, IJSObjectReference readerRef, long size)
    {
        JSReaderRef = readerRef;
        JSRuntime = jsRuntime;
        m_Size = size;
    }

    public override async ValueTask<int> ReadAsync(Memory<byte> buffer, CancellationToken cancellationToken = default)
    {
        int numWritten = 0;
        if (m_Buffer != null && m_BufferOffset < m_Buffer.Length)
        {
            int numToCopy = m_Buffer.Length - m_BufferOffset;
            if (numToCopy > buffer.Length)
                numToCopy = buffer.Length;
            m_Buffer.AsMemory(m_BufferOffset, numToCopy).CopyTo(buffer);
            numWritten += numToCopy;
            m_BufferOffset += numToCopy;
        }

        int leftToWrite = 0;
        while ((leftToWrite = buffer.Length - numWritten) > 0)
        {
            byte[] data = await JSRuntime.InvokeAsync<byte[]>("doRead", JSReaderRef);
            if (data == null || data.Length == 0)
                break;
            if (data.Length <= leftToWrite)
            {
                data.CopyTo(buffer.Slice(numWritten));
                numWritten += data.Length;
            }
            else
            {
                data.AsMemory(0, leftToWrite).CopyTo(buffer.Slice(numWritten));
                numWritten += leftToWrite;
                m_Buffer = data;
                m_BufferOffset = leftToWrite;
            }
        }
        return numWritten;
    }
}

with this Javascript:

async function doRead(reader) { return (await reader.read()).value; }

I will keep using the second version. I have looked at the performance implications of using either and the second one just makes more sense: The Javascript runtime already knows in what chunks it is reading a file, and we might as well just operate on those. In contrast, the BYOB version potentially uses smaller chunks than what the Javascript runtime used and thus incurs a bunch of overhead with all the back-and-forth between .NET and JS (plus additional copying).
For comparison, this code here opens a file, reads it line by line, and then writes it out again:

var reader = new StreamReader(inStream);
var writer = new StreamWriter(outStream);
string? line;
while ((line = await reader.ReadLineAsync()) != null) {
	await writer.WriteAsync(writeLine);
}
await writer.FlushAsync();

Using the second ReadStream is 10x faster than using the first one, because the buffer sizes are more sensible. If you use the BYOB based stream, you need to instead instruct the reader to use a more sensible buffer size, e.g. this makes the perf difference almost disappear:

var reader = new StreamReader(inStream, bufferSize: 4 * 1024 * 1024);
var writer = new StreamWriter(outStream);
string? line;
while ((line = await reader.ReadLineAsync()) != null) {
	await writer.WriteAsync(writeLine);
}
await writer.FlushAsync();

@sschoener
Copy link
Author

As an aside, the example-code above suffers not just from the buffering but also from the fact that frequent *Async calls are mre costly in the Mono-powered WASM than you'd think. So it's just an example :)

@KristofferStrube
Copy link
Owner

Really great work @sschoener 😁

It makes sense. It's great to get a headstart on how to make some alternative implementations. I will look into these as well and find the best solution. I think your focus on some sort of benchmarking is also a very valid point.

If we end up finding that there are trade-offs between correctness, convenience, and performance it might make sense to make it configurable which strategy it will use for stream reading.

Your comment regarding the async invocations is also very valid. For many of my libraries, I have synchronous versions of each class if they have any methods of properties that can be invoked synchronously. Blazor.Streams is no exception and I actually have an InProcess variant of the ReadableStream i.e. the ReadableStreamInProcess but I have not made any adaptions to override the ReadAsync method in this implementation. But I think it will add a lot of value as you also state. I don't think this should be considered in this issue, but I might create a separate issue for it once this one is closed.

@sschoener
Copy link
Author

sschoener commented Dec 10, 2023

I'm not actually sure how to get rid of the *Async overhead. On dotnet-proper, this overhead is very small, but on the Mono-WASM setup it seems to matter. I have no idea how the whole Mono-WASM interop works, but I have plenty of vanilla-Mono experience and more overhead there wouldn't surprise me. The problem that I see is that ultimately you are still going to call down into Javascript at some point to read/write, and those calls are always async -- the IJSRuntime doesn't seem to expose alternatives. You also can't just Task.Wait() on this runtime, so you're left with marking your entire callstack async...

While this is definitely an additional issue, I'm taking the opportunity to also inform you that you can save a good bit of time on the writing side by special-casing for arrays (that's the common case) and hence skip the expensive copy in buffer.ToArray() almost always:

public override async ValueTask WriteAsync(ReadOnlyMemory<byte> buffer, CancellationToken cancellationToken = default)
{
    if (System.Runtime.InteropServices.MemoryMarshal.TryGetArray(buffer, out var segment))
        await JSRuntime.InvokeVoidAsync("doWrite", JSStreamRef, segment.Array, segment.Offset, segment.Count);
    else
        await JSStreamRef.InvokeVoidAsync("write", buffer.ToArray());
}

where doWrite is

async function doWrite(stream, bytes, offset, count) {
    const sub = new Uint8Array(bytes.buffer, offset, count);
    await stream.write(sub);
}

@KristofferStrube
Copy link
Owner

If you want to read more about synchronous JSInterop then you might want to read this: IJSInProcessRuntime

@sschoener
Copy link
Author

Neat, thank you! I didn't know about that. I've really only started using Blazor yesterday, so it's all pretty new :). I'm not sure how much the in-process runtime would help here since the StreamWriter API on the JS side seems to be inherently asynchronous.

@KristofferStrube
Copy link
Owner

There are definitely some improvements possible like reading the value of the ReadableStreamReadResult, which can be further improved by using the synchronous IJSUnmarshalledRuntime type. This is though out of scope for this issue, so I have created a new issue where we can discuss those parts: #4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants