Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Callback APIs should come first. Evented APIs should be scaffolded around them #188

Closed
bjouhier opened this issue Dec 20, 2014 · 13 comments
Closed
Labels
stream Issues and PRs related to the stream subsystem.

Comments

@bjouhier
Copy link
Contributor

I'm posting this as a follow up to #92 #89 and #153. This issue is particularly acute around streams but the layering of events on top of callbacks is actually a more general issue in node APIs (connect event, for example, could/should be layered on top of a connect(cb) call).

Counter truth 1: Stream APIs must be evented.

Wrong: read(cb) is sufficient for a readable stream API and write(data, cb) is sufficient for a writable stream. With these APIs, pipe can be implemented in 3 lines of code.

Counter truth 2: Counter truth 1 is true because a callback API cannot handle backpressure.

Wrong again: it is all a question of encapsulation: the low level resource you are interacting with may have an evented API (pause/resume on the reading side, drain on the writing side) but you can encapsulate this low-level event handling into a callback API (read(cb) and write(data, cb)).

Once you have the callback APIs you don't need to worry about backpressure. It will happen naturally through the event loop. You will reason in terms of buffering rather than backpressure. Think of it: if data comes in fast you'll have to buffer it and then pause the input. Then, when someone calls your read, you can resume the stream to refill the buffer. Backpressure is handled inside the read call, it does not need to be exposed to the consumers of your stream. In other terms, what comes in must come out, if nobody's ready to read all you can do is buffer and you'd rather close the tap.

Truth 1: A callback API is easy to document, an evented one is not.

With a callback API, you just need to document what the method does and what its parameters are; you don't even need to say that cb will be called as cb(err) if there is an error and cb(null, result) otherwise because this is a general rule in node. You don't even need to document that cb will only be called once when the operation completes (successfully or not) because this too is a general rule.

With an evented API you need to document all the events and their parameters but this is not all: you also need to document how the events are sequenced and what expectations the consumer of the API can make on their sequencing. If you are on the producer side (implementing a stream) you must make sure that you meet these sequencing rules. This is the part that gets really tricky and is the source of so many questions/issues.

Truth 2: It is very easy to scaffold an evented APIs around a callback API. The reverse is more difficult.

Proof:

function eventify(read) {
  return function(cb) {
    var self = this;
    read(function(err, data) {
      if (err) self.emit('error', err) 
      else if (data === undefined) self.emit('end');
      else self.emit('data', data);
      cb(err, data);
    });
  }
}

Truth 3: Rigorous error handling is possible (even easy) with a callback API.

This is still tricky with callbacks (but possible). But this is easy with promises, generators and some of the other async solutions.

The big difference between a callback API and an evented API is that pipe will naturally take a callback in a callback API. The signature will be reader.pipe(writer, cb). The callback is called when the pipe operation completes. If the pipe fails cb will receive the error.

Also, it is better to have separate transform and pipe calls. The transform calls do not take a callback, they just pass errors along the chain. Only the pipe call does take a callback; it is always at the end of the chain. So the chain looks likesource.transform(t1).transform(t2).pipe(writer, cb);

No error is lost in such a chain. If something fails, the error will always end up in the pipe callback.

Truth 4: a stream API can be content agnostic

No need to distinguish a text mode, a binary mode and an object mode at the stream level. The core API can just handle data in an agnostic way. The only thing that's needed to keep the API simple is a watchdog value for the end of stream: undefined is the natural candidate.

Truth 5: a callback API lends itself naturally to a monadic, extensible API.

With a callback API, all it takes to implement a readable stream is a simple read(cb) call. All the fancier stream API can be scaffolded around this single call, in a monadic style.

The monadic API will combine two flavors of calls: reducer calls (like pipe) that terminate a chain and take a continuation callback parameter and non-reducer calls (like transform) that produce another stream and can be chained. A chain that only contains non-reducers does not pull anything; it just lazily sits there. The reducer at the end of the chain triggers the pull.

Wrapping-up

The tone is probably not right but I would like to shake the coconut tree. I just see all these discussions going forever around streams being complex, error handling being problematic, etc., when there is a simple way to address all this.

I know that I'm touching a sensitive point and that this may likely get closed with a laconic comment.

If streams were simple and well understood and if error handling was not a problem any more I would not post this. But this is not the case and the debates are coming back (see recent discussions). I know that it is "very late" to change things but apparently io.js is there to shake things up and maybe take some new directions. So I'm trying one last time.

I have a working implementation of all this (https://github.com/Sage/ez-streams) and we are using it extensively in our product. So this is not a fantasy. My goal is not to have core take it literally, just to consider the concept behind it.

Note: there are lots of similarities with @dominictarr's event-streams (https://github.com/dominictarr/event-stream) and with lazy.js (http://danieltao.com/lazy.js/). The main difference is that all the functions that may be grafted on the chain (transforms, filters, mappers) are async functions by default in ez-streams.

@jonathanong
Copy link
Contributor

-1 the only change i want to see in stream's API is for it to use https://github.com/whatwg/streams

@chrisdickinson
Copy link
Contributor

Counter truth 1: Stream APIs must be evented.

I feel that the current relationship between Streams and EventEmitters is inverted. So yep, no argument from me here.

Counter truth 2: Counter truth 1 is true because a callback API cannot handle backpressure.

min-stream, I think, fully backs up your point here.

Truth 1: A callback API is easy to document, an evented one is not.

Callback-based streams are still going to have to document the sequence of events -- or put another way: no matter what, there will be a state machine that needs documented.

Truth 2: It is very easy to scaffold an evented APIs around a callback API. The reverse is more difficult.

Agreed. No argument here!

Truth 3: Rigorous error handling is possible (even easy) with a callback API.

Determining when and where to forward errors is tricky no matter what. Basing streams on callbacks does not necessarily imply that .pipe would grow a callback parameter -- the two items are separate. The question at the moment is: where do errors flow, since there may be multiple readers or writers on either side of a given stream? Should they flow up the stream, even though upstream may be agnostic of the errored stream? Downstream to (potentially multiple) child streams, who may themselves have multiple sources?

Truth 4: a stream API can be content agnostic

Agreed, though I disagree with "watchdog value." All values should be in-alphabet, the normal/ended state of the stream should be kept as a property of the stream. A read from an empty, "ended" stream should signal the "end" event to the reader. A stream may be ended but still have data buffered, waiting to be read. Sentinel "out of alphabet" values are an enticing, but brittle, solution.

Truth 5: a callback API lends itself naturally to a monadic, extensible API.

Yes and no. There will always be underlying sources that push data at you -- TCP sockets, for example -- that will have to be handled.

I know that I'm touching a sensitive point and that this may likely get closed with a laconic comment.

Hopefully this isn't too laconic!

I think you would be pleasantly surprised by the whatwg stream spec that @jonathanong pointed out. The intent of core, as I understand it, is to move in that direction. The WHATWG streams, having the benefit of hindsight, pull apart a lot of the concerns of Node streams quite nicely -- especially with regards to content introspection, buffering, and backpressure.

That said, the path from where we are today to where we'd like to be isn't set in stone. There are a lot of packages that depend on streams. Somewhat more onerously, those streams depend on a specific .pipe protocol, which is implemented in terms of event emitter events. This is the means by which streams1 streams communicate with streams2 and streams3 streams, not to mention readable-stream-originated streams objects. Forward progress will have to be incremental, and I suspect that the best way to go about it will be to implement the old-style stream spec on top of whatever will eventually subsume and replace it.

To @jonathanong's point: it is unlikely that will be able to flip the switch and go directly from node streams to whatwg streams. There will almost certainly be a middle ground state that is reached -- if nothing else, because promises would have to be accepted into core before the stream spec.

In any case, thanks for the well-written issue.

@bjouhier
Copy link
Contributor Author

@chrisdickinson Thanks a lot for the detailed reply. It feels good to engage in constructive discussions.

I had looked at the whatwg specs and yes, it seems to clean up a number of things. But I still find it hairy and I think that it mixes two levels: the device level and the stream level. If we distinguish these two levels we can get to monadic paradise at the stream level. If instead we couple them (which is the case in node streams and in whatwg) this is much more difficult.

Probably sounds cryptic but I'll explain below:

Terminology

To clarify things I need a bit of terminology. I will stay away from whatwg and concentrate on node and ez-streams for now, to keep things relatively simple.

In node, everything is a stream. If I take a chain like:

source.pipe(t1).pipe(t2).pipe(destination)

and if I cut it in two as:

var mid = source.pipe(t1);
mid.pipe(t2).pipe(destination);

Everything is a stream: source, mid, destination, t1 and t2 are streams.

In ez-streams, the chain is slightly different:

source.transform(t1).transform(t2).pipe(destination, cb)

let's cut it:

var mid = source.transform(t1);
mid.transform(t2).pipe(destination, cb);

Now I can introduce a bit of terminology:

  • source is a reader stream. If it is really the origin of the stream, it is a reader device.
  • mid is a reader stream. It is not a device because a transform has been applied.
  • t1 and t2 are transforms, they are not streams.
  • destination is a writer stream. If it is really the ultimate destination of the stream (writer streams can be transformed too), it is a writer device.

Inheritance: a device is always a stream.

Some examples:

  • a file reader, a mongodb collection reader, a SQL query reader, a stdin reader are all reader devices.
  • a file writer, a mongodb collection writer, a SQL table inserter, an stdout writer are all writer devices.
  • a gzip inflater or deflater, a json, xml or csv parser, a filter, a projection are all transforms.
  • a reader which has been deflated and parsed is a reader stream but not a device.

Note: is is also possible to create synthetic devices. For example, the number reader that I used to illustrate ez-streams.

Monadic stream API

My claim is that it is important to distinguish the device and stream levels, and that, if we do so, we can have a very simple, callback based, monadic API for streams. I won't give much details here, the ez-streams readme explains it all.

Just an important point about this API: as I already said, it consists of reducers and non-reducers. pipe is a reducer but it is not the only one, there is a whole slug of reducers: forEach, reduce (of course), every, some, toArray, etc. Similarly, transform is not the only non-reducer, there is a slug of them: map, filter, skip, limit, while, until, tee, etc. This entire API is scaffolded around 2 essential read and write calls. It's all callback based.

Devices

In this approach the role of a device is to deal with the low level device-specific issues. This is where we interact with low level events and handle backpressure. The device exposes a monadic stream API (reader or writer) and it may also expose a device-specific API.

I think that it is important to decouple the device-specific API from the stream API. This way we can keep the stream API as simple as possible and monadic (KISS). The device specific interactions are handled out-of-band, outside of the stream pipeline.

Back to the discussion

This being said, I can get back to the discussion points:

Documentation: there is no sequence of events in the proposed stream API. There is a single read or write call (*). Event handling is encapsulated in the devices. Events may or may not be exposed as events through device specific APIs. The state machine, if any, is inside the device. The stream API itself is easy to document (and implement).

(*) The reader design is lacking an abort() method which would always propagate upwards to the device. I had this in my backlog but haven't had time to implement yet (does not feel too hard though). In our app we are currently always reading streams until we get either an EOF or an error so we haven't had a real need for it yet but I feel that this is coming.

Error handling: the distinction between reducers and non-reducer helps a lot here. Every chain must be terminated by a reducer, and this is where errors will be propagated. The reducer will not always be a pipe but this is where errors will be propagated in all cases. The fact that transform is not a reducer simplifies things because 1) it does not actively pump data and 2) it just propagates errors downstream.

Multiple sources and destinations are still a bit experimental in the ez-streams design and we haven't used them much in our application yet (just simple tee for logging and debugging). It does not seem to introduce too many conceptual difficulties though. In the current implementation errors always propagate downstream to all branches. abort() would propagate upwards but the tee and fork methods should have a setting to inhibit this propagation from specific branches (we don't want a whole chain to abort it has been teed to a debugging output chain that got closed abruptly).

I think that having a differentiated terminology (devices, readers, writers, transforms, mappers, reducers) and a rich set of methods (map, filter, transform, tee, pipe, reduce, forEach, etc.) helps a lot. It makes it easy to assign responsibilities. Once you have said that reducers are responsible for handling errors, you have said it all. In the current node.js design this is much fuzzier and difficult to explain.

Watchdog value: this is a more minor issue. As undefined is not serializable in JSON nor in XML (there is xsi:nil but nothing for undefined), it felt natural to use it as watchdog. This can be easily changed by having read return a { value: val, ended: bool } object, or by having it throw a special exception. I just did not felt that this was worth the trouble (KISS - there is also a bit of performance overhead).

If a reader has buffered data and receives an EOF it should deliver things in order: first the data that it has buffered and then the EOF. Whether the EOF is a undefined or not does not make much difference.

Anyway, thanks again Chris for taking the time to respond. At a high level, I feel that there a two possible directions here. The first one is towards whatwg streams; it is quite device centric and more aligned on node's current design. The second is more towards lazy.js, more application-centric; it leaves the device details out of the picture and it is more centered on data processing (transforms, filter, map, reduce).

@sonewman
Copy link
Contributor

There is a lot of interesting information in this issue. Streams with a smaller interface would be of huge benefit. The more i think about it, it would be pretty simple to create a wrapper around them in order for them to be compatible with the current streams we know.

@algesten
Copy link

I like the structured categorising of streams. The only term I dislike are "devices", but then I may have missed some neckbeard classic book or something. Is there a precedence for calling producing/consuming streams, devices?

@rlidwka
Copy link
Contributor

rlidwka commented Dec 23, 2014

I like this suggestion. Not sure if we could replace current api with callback-based completely, but it is worth discussing.

How about a user-land module that wraps streams in the suggested way exposing callback api?

@bjouhier
Copy link
Contributor Author

@sonewman @rlidwka
ez-streams does the wrapping in both directions. I described the technique in a blog article a long time ago.

ez-streams is currently implemented with streamline.js but this is just an implementation detail. You can use it directly from pure callback-code (and promises BTW).

@algesten I chose device because they are the real inputs and outputs of stream chains. This is where the chains interface with the physical world (files, networks, consoles, databases). A bit like the /dev/xxx in UNIX pipes.

@loveencounterflow
Copy link

i think i go with everything in the proposal (having worked with streams for almost a year now), including the sentinel value of undefined a.k.a. void 0. This feels so much cleaner than having a more specific value (highland.js is fighting with that). One alternative i see is making the stream instance's attributes available to transformers through the callback, say, as .transform( function ( value, done ) { if ( done.end ) ... } ), done.state === 'ended' or similar. IMHO using callbacks to get NodeJS going was a great decision because it puts the entire thing on a sound and well-understood basis; there may actually be too many event emitters in core already.

Frankly, what really bothers me with event emitters is that there's

  • no (standard) way of listening to all events,
  • no introspection to get all the possible event names, and
  • no audible failure of any kind if you inadvertently bind to emitter.on( 'nosuchthing' )—the code just fails and you will never know why without reading the docs very carefully and spell-check all your event bindings.

Now that's one conspiracy to make debugging a pain and three good reasons for not using them.

@loveencounterflow
Copy link

BTW i think undefined has several properties that make it a good choice as an end-of-stream sentinel:

  • it is not usually used 'actively'—null is there to signal 'no value';
  • it is easy to test for as x === undefined (more properly x === void 0 though);
  • it is not JSON-serializable, but in a JSON stream over the wire you would have to agree on a meta-format (message format) anyhow because for sure in that situation not everything can be in-alphabet (phone numbers once were in-alphabet but that relied on having a common language and a meta-talk with some at the local exchange).
    One undesirable property of undefined is that JS code is prone to spewing those values out on accessing undefined object attributes. Other suitable values should be hard to find, though; NaN and isNaN come to mind.

All in all i find the proposal very intriguing; right no in my streaming code i use .pipe( $ function( data, send )... a lot, where $ turns a function into an (event/through) stream and send can also be used as send.error if need be; the advantage is you don't have to write cb(null,data) all the time.

@sonewman
Copy link
Contributor

@loveencounterflow it is interesting that you say about listening to every event. Other libraries have implemented this. It would be a useful feature, but it would introduce an extra step in the event emitter code, which is at present a hot piece of code no matter the program.

  • no audible failure of any kind if you inadvertently bind to emitter.on( 'nosuchthing' )—the code just fails and you will never know why without reading the docs very carefully and spell-check all your event bindings.

The problem with this is that we can't set a predefined list of events before hand (and I am not sure that we should have to) because you are always going to want to start listening to event before it happens.

All of these things can be simply applied in a small module wrapping an event emitter. I don't think it is justified for this stuff to be applied at a lower level, where performance is more prevalent.

The idea of changing the stream EOF has been a topic of discussion and @chrisdickinson has some good ideas about this.

All in all i find the proposal very intriguing; right no in my streaming code i use .pipe( $ function( data, send )... a lot, where $ turns a function into an (event/through) stream and send can also be used as send.error if need be; the advantage is you don't have to write cb(null,data) all the time.

I don't think we should change the node callback pattern for connivence, these are unique API ideas and would create additional overhead creating more state containing objects for every callback. As well as further complexity in use.

Again this can easily be used in a wrapping module (I assume you have already implemented) to use this pattern.

@loveencounterflow
Copy link

+1 for not changing the callback pattern cb( error, data ) in core; it's too valuable.

@Fishrock123 Fishrock123 added the discuss Issues opened for discussions and feedbacks. label Jan 30, 2015
@Fishrock123 Fishrock123 added stream Issues and PRs related to the stream subsystem. and removed discuss Issues opened for discussions and feedbacks. labels Feb 26, 2015
@Fishrock123
Copy link
Contributor

Closing as there isn't much actionable here, discuss the streams points in https://github.com/iojs/readable-stream if necessary.

@piscisaureus
Copy link
Contributor

Note that iojs is currently moving towards a "callbacks-only" based internal API, and EventEmitters will be option for use cases where they are more convenient.

syg pushed a commit to syg/node that referenced this issue Jun 20, 2024
eti-p-doray pushed a commit to eti-p-doray/node that referenced this issue Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stream Issues and PRs related to the stream subsystem.
Projects
None yet
Development

No branches or pull requests

9 participants