Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

streams #30

Closed
davidmarkclements opened this issue May 22, 2014 · 50 comments
Closed

streams #30

davidmarkclements opened this issue May 22, 2014 · 50 comments
Milestone

Comments

@davidmarkclements
Copy link

Does the whole buffer have to load into run time memory or is it possible with the underlying C library to implementing a true streaming api, so that a file can be streamed from disk, through sharp stream to network socket or file read stream or whatever else.

ref: http://nodejs.org/api/stream.html

Neither graphicks magik, nor image magik wrappers are capable of streams - theres one module (gm I think) that has a stream api, but its faked somewhat, the module still has to load the full image into memory regardless. So a true streaming interface for sharp would be a massive usp

@lovell
Copy link
Owner

lovell commented May 22, 2014

Hi David, when the input image is on the filesystem, libvips only reads data as it is needed further down the processing pipeline. It's not "streaming" in the Node.js sense of the word but it does minimise memory usage.

True streaming of compressed input data from a network socket would probably require modification to libvips. Summoning the vastly superior knowledge of @jcupitt, who may be able to offer help here.

@jonathanong
Copy link
Contributor

btw you'd want to change the .write() command to something else for future stream support. maybe toFile() or writeTo()

@jcupitt
Copy link
Contributor

jcupitt commented May 23, 2014

Hi @davidmarkclements, as @lovell says, this is currently sort-of supported. When you process a large jpeg image (for example) libvips never decompresses the whole thing all at once, it just decompresses and then recompresses the area being processed as it works down the image.

If you want to work over unix pipes or network sockets, you'd need to add a new pair of load and save operations, something like vips_foreign_load_jpeg_fd( int fd, VipsImage **out, ... ) and vips_foreign_save_jpeg_fd( VipsImage *in, int fd, ... ). The saver would be trivial, you can just reuse the file saver, but the loader would be a bit more work, you'd need to make a new libjpeg input method.

You'd then need to expose that in sharp somehow, @lovell would be the expert there.

@jcupitt
Copy link
Contributor

jcupitt commented May 23, 2014

If you're curious, the load-jpeg-from-buffer class is defined here:

https://github.com/jcupitt/libvips/blob/master/libvips/foreign/jpegload.c#L249

And it calls this set of functions to do the work:

https://github.com/jcupitt/libvips/blob/master/libvips/foreign/jpeg2vips.c#L1042

So you'd probably need to add under 500 lines of C.

@sebpiq
Copy link

sebpiq commented Jun 8, 2014

👍 for streams!

@davidmarkclements
Copy link
Author

sorry for delay on this guys, look I can happily get involved in the JS side of things, in terms of creating the stream interfaces and hooking them up to the bindings (although I'd have to research that, haven't worked with bindings before). But as for actually editing the C code, it's not my area and I wouldn't want to screw it up - if someone else who's strong in C wants to team up let me know and we'll get this done - I think it would be a huge win to have true stream image processing available in JS. I really don't know of anything else right now that offers that.

@jcupitt
Copy link
Contributor

jcupitt commented Jun 9, 2014

OK, I'll try doing jpeg as an experiment. I've not done much with sockets, I guess I just read() and don't seek().

Would you always be reading from streams with a known type? Sniffing jpeg/png/etc. from a stream would be fiddly. I guess you'd read a mine header (something like that?) first to get the file type, or perhaps you'd be streaming from a source you controlled.

@jcupitt
Copy link
Contributor

jcupitt commented Jun 9, 2014

I've made an experimental branch with a quick hack to support load-jpeg-from-stream:

https://github.com/jcupitt/libvips/tree/load-from-stream

For example, you can do this:

$ cat k2.jpg | vips jpegload_fd 0 x.v

Call from C like this:

/**
 * vips_jpegload_fd:
 * @fd: descriptor to load from
 * @out: image to write
 * @...: %NULL-terminated list of optional named arguments
 *
 * Optional arguments:
 *
 * @shrink: shrink by this much on load
 * @fail: fail on warnings
 *
 * Read a JPEG-formatted stream from @fd into a VIPS image. Exactly as
 * vips_jpegload(), but read from a descriptor. This operation does not seek()
 * and therefore can read from a network socket. 
 *
 * See also: vips_jpegload().
 *
 * Returns: 0 on success, -1 on error.
 */
int
vips_jpegload_fd( int descriptor, VipsImage **out, ... )

We should try hooking this up to sharp somehow and see how it behaves.

@jcupitt
Copy link
Contributor

jcupitt commented Jun 9, 2014

I thought of a problem with this: it can only load a single image from a socket. If you want to send a series of images, you'll have to close and reopen the connection. I suppose this might be OK in some situations, but probably not over a slow network connection.

The reason is buffering. As you can see, it reads as many bytes as it can each time, up to the buffer capacity:

https://github.com/jcupitt/libvips/blob/load-from-stream/libvips/foreign/jpeg2vips.c#L1349

So it can read the end of one image and the first part of the next. When image1 is complete, we will have unused bytes buffered which we would have to be saved somehow to be used for the next image.

It's unclear how this could be done neatly. vips does not support image streams (reading many images from one source) and adding it would be hard, so that's probably out. We'd probably need to add some sort of intermediate buffer object which supported "unread", ie. allowed unused bytes to be pushed back into it.

How does node tackle this sort of issue? Do they have a socket object that supports pushback?

@lovell
Copy link
Owner

lovell commented Jun 9, 2014

Thanks @jcupitt, Node.js' net.Socket object is our friend here.

I'll try to make some time over the next few days to experiment with the new vips_jpegload_fd API and work out the most elegant way for sharp to expose it.

@davidmarkclements
Copy link
Author

Using a socket would make the streams API super easier, essentially you just expose the socket directly or add methods that act as proxies to the sockets methods. But is that really the right thing to do? The normative way as I understanding is to use C bindings http://nodejs.org/api/addons.html

And then wrap a stream interface around the exposed parts of the underlying C lib. Using a socket will add unnecessary overhead IMO

@davidmarkclements
Copy link
Author

Sorry on mobile didn't see @jcupitts comment what I said May or may not be relevant. Btw pushback can be done when creating streams using the Stream module but don't know if that fits what we're talking about

@lovell
Copy link
Owner

lovell commented Jun 10, 2014

@jcupitt If libvips provided an API that accepted partial chunks of image data then that would fit with @davidmarkclements suggestion about managing the Streams in JavaScript.

As soon as the chunk(s) that comprise the image header data have arrived then processing can begin, but a thread(s) may have to block waiting for further chunks of image data.

This is essentially the approach taken by the Node.js wrapper around LAME, which takes advantage of the underlying library's chunk-aware methods such as lame_encode_buffer.

Sockets and Streams both involve buffering data somewhere; something will block on a lack of data :)

Shall we trial the Sockets method first as that currently feels like the quickest way to a proof-of-concept?

@jcupitt
Copy link
Contributor

jcupitt commented Jun 10, 2014

I was thinking about this again.

It would work if you're OK only being able to read a single image off a socket. That might be useful in some situations.

The problem is that vips (or rather the libjpeg read system) HAS to have a buffer, and therefore you can't send more then one image down a socket, since the end of the first image will eat the first part of the second.

We need to have a reusable intermediate read object to handle the buffering, and we need to be able to hand the libjpeg read system pointers into it. Therefore I think the buffer has to be a new vips object. In C, the API might look like:

int descriptor;

VipsInputStream *stream = vips_input_stream_new( descriptor );

for( i = 0; i < 10; i++ ) {
  VipsImage *x;

  if( vips_jpegload_stream( stream, &x, NULL ) )
    return( -1 );
  ... do something with image x
  g_object_unref( x );
}

g_object_unref( stream );

So I think we just need to get a file descriptor from the node.js stream.

@lovell
Copy link
Owner

lovell commented Jun 18, 2014

After playing with Node.js Streams for the last couple of evenings I now believe the best approach is to avoid file descriptors and instead have sharp extend the stream.Transform class, buffering data in JavaScript-land first. This should then provide support for multiple images passing through the same pipe.

I've got some sample code working locally that needs tuning, documenting and benchmarking, but the early signs are that we should be able to support JPEG, PNG and WebP (but not TIFF) as both input and output streams without any modification to libvips. Hooray!

@lovell lovell added this to the v0.6.0 milestone Jun 18, 2014
@davidmarkclements
Copy link
Author

YES! That's exactly the right approach

On 19 Jun 2014, at 06:02, Lovell Fuller [email protected] wrote:

After playing with Node.js Streams for the last couple of evenings I now believe the best approach is to avoid file descriptors and instead have sharp extend the stream.Transform class, buffering data in JavaScript-land first. This should then provide support for multiple images passing through the same pipe.

I've got some sample code working locally that needs tuning, documenting and benchmarking, but the early signs are that we should be able to support JPEG, PNG and WebP (but not TIFF) as both input and output streams without any modification to libvips. Hooray!


Reply to this email directly or view it on GitHub.

@davidmarkclements
Copy link
Author

In terms of buffering the data in JS land, it would be great to ensure the transform stream honours the highWaterMark option (ie the capacity of the buffer) - that way a dev can control the trade off between speed and memory usage.

That's one of the problems with graphicks/image magick wrappers - lack of control over memory usage leads to crashing on under resourced servers. Whereas a stream with a configurable highWaterMark allows the dev to operate and lower resources with slower delivery time. And vice versa, if they've plenty to spare they can up the watermark number (as long as the CPU can keep up).

If there's anyway I can help let me know (peer review?)

@jcupitt
Copy link
Contributor

jcupitt commented Jun 19, 2014

This sounds great, but I don't really understand how it can work :-( but please prove me wrong, of course.

libjpeg's read system just has a thing for "I need more data", it does not tell you how much more data it would like. So you always read 4096 bytes (or whatever) and hand over that buffer.

After it has finished reading an image, there will be some unused bytes left in libjpeg's input buffer. These unused bytes will contain the beginning of the next image (perhaps part of a mime header?). They can't be thrown away, they somehow have to be reused. They need to be extracted from libjpeg and given to the input system of whatever image reader you need next.

So ... I still think we'll need a VipsInputStream object or equivalent to be able to move input buffers between image readers.

@jcupitt
Copy link
Contributor

jcupitt commented Jun 19, 2014

libpng's input system works in the same way, it asks for more data but won't tell you how much.

libtiff needs to be able to seek on the input, so it can't read from a stream, unfortunately.

webp, again, just asks for more data.

@lovell
Copy link
Owner

lovell commented Jun 19, 2014

In summary, we need to buffer image data somewhere, be that within the underlying decompression libraries themselves, in libvips or in sharp.

The memory usage problems of *magick are usually due to them holding the entire uncompressed image in memory at once. If we buffer compressed image data only, that's a big improvement over existing solutions.

What I'm suggesting is that, to start with, we have sharp manage Buffer concatenation of compressed image data as chunks arrive (I may move memory management to the C++ part of sharp if Node.js Buffer performance is poor). Then, when libvips provides support for John's suggested VipsInputStream, we can switch to take advantage of that.

The external API of sharp will remain the same and can slowly get a bit more efficient over time, by pushing the responsibility for buffering upstream.

Here's a usage example of how elegant the Stream-based API of sharp should hopefully be, in this case resizing to 100px wide whilst maintaining the same format.

var transformer = sharp().resize(100);
inputStream.pipe(transformer).pipe(outputStream);

This unlocks the ability to create memory-efficient command line tools that pipe/transform image data from stdin to stdout.

@jcupitt
Copy link
Contributor

jcupitt commented Jun 19, 2014

Ah, I see. Sure, sounds great. This is using vips_image_new_from_buffer(), I guess?

https://github.com/jcupitt/libvips/blob/master/libvips/iofuncs/image.c#L1916

I'll add a quick hack of the input stream stuff. You should be able to subclass VipsStream and hook it up to the node.js stream system, hopefully.

edit if you've not come across it, there's a write to buffer operation too you could maybe use for the output side:

https://github.com/jcupitt/libvips/blob/master/libvips/iofuncs/image.c#L2205

@davidmarkclements
Copy link
Author

@lovell I love how you've totally groked the awesomeness.

So. Something else we need to consider is handling backpressure. Since it's a transform stream it's going to need to do two things in order to be a stream that plays nice within a stream chain (and indeed within the larger OS environment).

  1. if incoming data is arriving too fast (e.g. say from.. really really fast storage) the write method should return false to apply back pressure to input stream
  2. if the read part of the transform stream receives false from the write stream its piping into (e.g. if the destination streams write method returns false) then it needs to pause until the destination stream is ready - you tell if a write stream is ready again by listening for a drain event

@jcupitt
Copy link
Contributor

jcupitt commented Jun 19, 2014

@davidmarkclements I think your 1 and 2 should happen automatically. They do on Linux at least.

  1. VipsInput only does a read()when it needs more data, so the OS will apply backpressure for us when the amount of stuff queued up hits the high water mark.
  2. Again, if the write side is producing data too quickly, the OS will suspend it to prevent overbuffering.

@davidmarkclements
Copy link
Author

@jcupitt But if the stream is being piped to another stream (say the other stream was slow http connection for instance), the other stream may need to apply back pressure to sharp stream,
at a JavaScript level this should ideally be catered to otherwise streams its piping to will be overwhelmed and data will begin to queue up in memory which is of course contrary to the point of streams.

As for the writing side.. we're talking about a source node stream writing to the sharp node stream,
e.g. extending @lovells example a little

var fs = require('fs');
var http = require('http');
var inputStream = fs.createReadStream('/some/image.jpg');
var transformer = sharp().resize(100);

http.createServer(function (req, res) {
  var outputStream = res;
  inputStream.pipe(transformer).pipe(outputStream);
});

So say we the read stream from disk is coming in too fast,
at a JS level we need to apply back pressure to let the
inputStream know it needs to pause and wait for a drain event.

I don't see how the OS can suspend the stream in this case,
because the sharp library isn't reading the file here, nor is
really operating at an OS level - the OS doesn't have control
or knowledge of the streams in Node. That has to be managed
at a JS level.

Further the input stream doesn't even need to be a file, it could be this scenario for instance:

var fs = require('fs');
var http = require('http');
var inputStream = http.get('http://some.image.com/image.png');
var transformer = sharp().resize(100);

http.createServer(function (req, res) {
  var outputStream = res;
  inputStream.pipe(transformer).pipe(outputStream);
});

Each stream implementation in Node has to manage and handle backpressure,
if they expect the OS to they'll quickly run into memory issues

@lovell
Copy link
Owner

lovell commented Jun 19, 2014

Given initially we'll be placing input data into a memory buffer, I think we can rely on the OS for read-side back-pressure.

For write-side, sharp will have to manage any back-pressure applied to it by the OS. I've got a couple of ideas here, mostly involving reducing the output buffer data into exponentially smaller chunks.

The rather handy brake module can use to emulate back-pressure to test this.

@jcupitt
Copy link
Contributor

jcupitt commented Jun 19, 2014

Ah, I thought these were network sockets and local pipes.

You're right if you want it to work for JS software streams you'd need a bit more logic. This could go in the VipsStream subclass I guess (as you maybe said).

@lovell
Copy link
Owner

lovell commented Jul 31, 2014

I think I've worked out how to correctly inherit from stream.Duplex with some minor API breakage.

The jpeg, png and webp methods need to become simple property setters and therefore can no longer trigger the processing pipeline without an explicit call to toFile, toBuffer or pipe.

Currently writing lots of tests.

@davidmarkclements
Copy link
Author

Hey @lovell sounds awesome, just to confirm, in case there's anything else that needs hooking up, the .on('data', fn) and/or stream.Duplex.prototype.read would also trigger the processing pipeline right?

@lovell
Copy link
Owner

lovell commented Jul 31, 2014

Yes, the pipeline will be triggered lazily only when _read is called internally to notify us that the consumer is ready. Sorry for any confusion.

@lovell
Copy link
Owner

lovell commented Jul 31, 2014

Commit 6c9f3de in the streams2 branch implements the native stream.Duplex object.

The tests from line 272 onwards provide good examples of the various methods of input and output.

Comments welcome!

@lovell
Copy link
Owner

lovell commented Aug 7, 2014

I've updated the streams2 branch to not (yet) break the backwards compatibility of accepting a callback with the png, jpeg and webp methods. They will still, for now, trigger the pipeline but also log a deprecation warning that this feature will be removed in the next major version.

Any final thoughts from @sebpiq, @jonathanong and @davidmarkclements before this hits master?

@lovell
Copy link
Owner

lovell commented Aug 18, 2014

The streams2 branch is now merged to master and will be included in the forthcoming v0.6.0 release.

Thanks to everyone involved for all the help getting this far.

@davidmarkclements
Copy link
Author

hey many that's awesome - sorry for lack of final thoughts

@lovell
Copy link
Owner

lovell commented Aug 21, 2014

Closing this feature request - thank you everyone for all the suggestions and feedback - I look forward to hearing all about your use of Streams.

@lovell lovell closed this as completed Aug 21, 2014
@jcupitt
Copy link
Contributor

jcupitt commented Oct 12, 2019

Hello, I've spent a few days implementing experimental support for true streaming:

libvips/libvips#1443

It seems to work. Any testing or feedback very welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants