Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add guide: "Backpressuring in Streams" #1109

Merged
merged 6 commits into from
Apr 22, 2017
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
264 changes: 264 additions & 0 deletions locale/en/docs/guides/backpressuring-in-streams.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,264 @@
---
title: Backpressuring in Streams
layout: docs.hbs
---

# Backpressuring in Streams

The purpose of this guide will describe what backpressure is and the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will → is to

problems in Streams that it solves. The first part of this guide will cover
these questions. The second part of the guide will introduce suggested best
practices to ensure your application's code is safe and optimized.

We assume a little familiarity with the general definition of
[`backpressure`][], [`Buffer`][] and [`EventEmitters`][] in Node.js, as well as
some experience with [`Streams`][]. If you haven't read through those docs,
it's not a bad idea to take a look at the API documentation first, as it will
help expand your understanding while reading this guide and for following along
with examples.

## The Problem with Streams
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is not a problem with streams. There is a problem that streams solve, which is backpressure, also called flow control in other communities. There are other solutions to backpressure. Unix pipes and TCP sockets solves this as well, and are the root of the following. Also, streams can be used for pull-based data handling.

The usual demo I do to show the problem is zipping a very large data file both in one go and with streams. As the first one crashes, the second one can work with data of any size. (the original credit goes to @davidmarkclements for that example). That shows the problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this review @mcollina. So essentially there is not separation from streams & backpressure? As in ... data-handling is tricky, but streams solves it with backpressure?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in "data handling is tricky, and streams is the solution that Node has adopted".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Syntactically, I am still a bit confused. Wikipedia describes backpressure in IT as

the build-up of data behind an I/O switch if the buffers are full and incapable of receiving any more data; the transmitting device halts the sending of data packets until the buffers have been emptied and are once more capable of storing information.

In this instance .. is backpressure the problem or the I/O switch that streams implements?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this instance .. is backpressure the problem or the I/O switch that streams implements?

Backpressure itself is the problem, and the backpressure handling mechanisms in the streams classes are our solution to it. :)


In a computer system, data is transferred from one process to another through
pipes and signals. In Node.js, we find a similar mechanism and called
Copy link
Member

@TimothyGu TimothyGu Feb 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe also mention sockets?

redundant "and"

[`Streams`][]. Streams are great! They do so much for Node.js and almost every
part of the internal codebase utilizes the `Streams` class. As a developer, you
are more than encouraged to use them too!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Streams isn't really a thing. There is no Streams module or Streams class. Make it clear what exactly you are referring to for each instance of [Streams][]:

  • the stream module
  • streams as a concept
  • Readable or Writable classes

For this paragraph, I'd say something like this:

In Node.js, we find a similar mechanism called streams in the [`stream`
module][]. Streams are great! They do so much for Node.js and almost every part
of the internal codebase utilizes that module.

Ditto below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TimothyGu we have Stream base class, from which each of the other inherits.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ This is something I'd like to clarify. Is it true to say we have streams as a few cases:

  • Streams base class (which has the standard functions in which all subclass streams share)
  • Streams module (which includes Readable, Writable, etc)
  • streams as a concept (which is, in the scope of this guide, the way in which data flow is delegated)


```javascript
const readline = require('readline');

// process.stdin and process.stdout are both instances of Streams
const rl = readline.createInterface({
input: process.stdin,
output: process.stdout
});

rl.question('Why should you use streams? ', (answer) => {

console.log(`Maybe it's ${answer}, maybe it's because they are awesome! :)`);

rl.close();
});
```


There are different functions to transfer data from one process to another. In
Node.js, there is an internal built-in function called [`.pipe()`][]. There are
[other packages][] out there you can use too! Ultimately though, at the basic
level of this process, we have two separate components: the _source_ of the
data and the _consumer_.

In Node.js the source is a [`Readable`][] stream and the consumer is the
[`Writable`][] stream (both of these may be interchanged with a [`Duplex`][]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or a Transform

stream, but that is out-of-scope for this guide).

## Too Much Data, Too Quickly

There are instance where a [`Readable`][] stream might give data to the
[`Writable`][] much too quickly --- much more than the consumer can handle!

When that occurs, the consumer will begin to queue all the chunks of data for
later consumption. The write queue will get longer and longer, and because of
this more data must be kept in memory until the entire process has completed.

```javascript
var fs = require('fs');

var inputFile = fs.createReadStream('REALLY_BIG_FILE.x');
var outputFile = fs.createWriteStream('REALLY_BIG_FILE_DEST.x');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const


// Secretly the stream is saying: "whoa, whoa! hang on, this is way too much!"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not immediately clear that writing to the disk is slower than reading. Either mention it explicitly ("on a computer where writing is slower than reading"), or use a slower writable stream (http? zlib?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writing to the disk is slower than reading from it. It might not be clear for a reader though, so I'm 👍 in adding a zlib compression before writing down.

inputFile.pipe(outputFile);
```


This is why backpressure is important. If a backpressure system was not
present, the process would use up your system's memory, effectively slowing
down other processes, and monopolizing a large part of your system until
completion.

This results in a few things:

* Memory exhaustion
* A very overworked garbage collector
* Slowing down all other current processes

## Memory Exhaustion



## Garbage Collection



## Overall Dip in System Performace



## How Does Backpressure Resolve These Issues?

The moment that backpressure is triggered can be narrowed exactly to the return
value of a `Writable`'s [`.write()`][] function. This return value is
determined by a few conditions, of course.

In any scenario where the data buffer has exceeded the [`highWaterMark`][] or
the write queue is currently busy, [`.write()`][] will return `false`.

When a `false` value is returned, the backpressure system kicks in. It will
pause the incoming [`Readable`][] stream from sending any data and wait until
the consumer is ready again.

Once the the queue is finished, backpressure will allow data to be sent again.
The space in memory that was being used will free itself up and prepare for the
next glob of data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

glob → blob / batch


This effectively allows an fixed amount of memory to be used at any given
time for a [`.pipe()`][] function. There will be no memory leakage, no
indefinite buffering, and the garbage collector will only have to deal with
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indefinite → infinite

one area in memory!

So, if backpressure is so important, why have you (probably) not heard of it?
Well the answer is simple: Node.js does all of this automatically for you.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make it be known somewhere that only .pipe() does this automatically (readable.on('data', buf => writable.write(buf)); doesn't use backpressure, for example). I know in the next section you are looking more closely at .pipe(), but it seems to be assumed that the reader knows all this is talking about .pipe().


That's so great! But also not so great when we are trying to understand how to
implement our own custom streams.

## Lifecycle of `.pipe()`

To achieve a better understanding of backpressure, here is a flow-chart on the
lifecycle of a [`Readable`][] stream being [piped][] into a [`Writable`][]
stream:

```javascript
+===============+
| Your_Data |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for _ here

+=======+=======+
|
+-------v-----------+ +-------------------+ +=================+
| Readable Stream | | Writable Stream +---------> .write(chunk) |
+-------+-----------+ +---------^---------+ +=======+=========+
| | |
| +======================+ | +------------------v---------+
+-----> .pipe(destination) >---+ | Is this chunk too big? |
+==^=======^========^==+ | Is the queue busy? |
^ ^ ^ +----------+-------------+---+
| | | | |
| | | > if (!chunk) | |
^ | | emit .end(); | |
^ ^ | > else | |
| ^ | emit .write(); +---v---+ +---v---+
| | ^----^-----------------< No | | Yes |
^ | +-------+ +---v---+
^ | |
| ^ emit .pause(); +=================+ |
| ^---^---------------------+ return false; <-----+---+
| +=================+ |
| |
^ when queue is empty +============+ |
^---^-----------------^---< Buffering | |
| |============| |
+> emit .drain(); | <Buffer> | |
+> emit .resume(); +------------+ |
| <Buffer> | |
+------------+ add chunk to queue |
| <--^-------------------<
+============+
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This diagram is not exactly clear. Why everything is going back to pipe()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I assumed every event emitter was being found back into pipe, and then pipe would delegate what would do next. Is that incorrect?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jessicaquynh it is incorrect. pipe sets up some closures, who would keep working, but the method is not invoked anymore. The diagram should represent what pipe set up, but not show pipe at all: https://github.com/nodejs/node/blob/master/lib/_stream_readable.js#L473-L613

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This diagram is making good progress! But I think you should split pipe() in its own step of the diagram, and not make it part of the "loop". The event handlers are, so we should show the distinction.

The syntax for events, e.g. drain(), is not common throughout the docs, so we should probaly use on('drain', cb)  instead.

also cc @lrlna who did some visualization on this in the past on paper.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if by this you meant something like:

                                                     +===================+
                         +-->  Piping functions  +--->   src.pipe(dest)  |
                         x     are set up during     |===================|
                         x     the .pipe method.     |  Event callbacks  |
+===============+        x                           |-------------------|
|   Your Data   |        x     They exist outside    | .on('close', cb)  |
+=======+=======+        x     the data flow, but    | .on('data', cb)   |
        |                x     importantly attach    | .on('drain', cb)  |
        |                x     events, and their     | .on('unpipe', cb) |
+-------v-----------+    x     respective callbacks. | .on('error', cb)  |
|  Readable Stream  +----+                           | .on('finish', cb) |
+-^-------^-------^-+    |                           | .on('end', cb)    |
  ^       |       ^      |                           +-------------------+
  |       |       |      |                           
  |       ^       |      |                                              
  ^       ^       ^      |    +-------------------+         +=================+
  ^       |       ^      +---->  Writable Stream  +--------->  .write(chunk)  |
  |       |       |           +-------------------+         +=======+=========+

Or even have more explicit steps come in?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, something like that is way better.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi! this is so so good, i love it!

in terms of the diagram, i find the loop a little weird to get around and understand, although i definitely see where you're coming from.

What about making it a bit more linear? I am not good with replicating the fancy thing you have going, but something like this on paper?

stream-backpressure



_Note_: The `.pipe()` function is typically where backpressure is invoked. In
an isolate application, both [`Readable`][] and [`Writable`][] streams
should be present. If you are writing an application meant to accept a
[`Readable`][] stream, or pipes to a [`Writable`][] stream from another app,
you may omit this detail.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what "application" / "app" means here. A standalone program? A library? A function?


## Example App

Since [Node.js v0.10][], the [`Streams`][] class has offered the ability to
overwrite the functionality of [`.read()`][] or [`.write()`][] by using the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't say "overwrite", but rather "modify the behavior"

underscore version of these respective functions ([`._read()`][] and
[`._write()`][]).

There are guidelines documented for [implementing Readable streams][] and
[implementing Writable streams][]. We will assume you've read these over.

This application will do something very simple: take the data source and
perform something fun to the data, and then write it to another file.

## Rules to Abide For Writing Custom Streams

Recall that a [`.write()`][] may return true or false dependent on some
conditions. Thus, when building our own [`Writable`][] stream, we must pay
close attention those conditions and the return value:

* If the write queue is busy, [`.write()`][] will return false.
* If the data chunk is too large, [`.write()`][] will return false (the limit
is indicated by the variable, [`highWaterMark`][]).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are building a Writable directly, this is not needed, as it is handled by the stream state machine. However, it is needed to use a Writable  directly, see my example in the Node docs: https://github.com/nodejs/node/blob/master/doc/api/stream.md#writablewritechunk-encoding-callback


_Note_: In most machines, there is a byte size that is determines when a buffer
is full (which will vary across different machines). Node.js allows you to set
your own custom [`highWaterMark`][], but commonly, the default is the optimal
value for what system is running the application. In instances where you might
want to raise that value, go for it, but do so with caution!

## Build a WritableStream
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Writable Stream" (note the space) for consistency


Let's extend the prototypical function of [`.write()`][]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not extending .write() here, but extending stream.Writable with a custom ._write() method, which is then used internally by .write().


```javascript
const Writable = require('stream').Writable;

class MyWritable extends Writable {
constructor(options) {
super(options);
}

_write(chunk, encoding, callback) {
if (chunk.toString().indexOf('a') >= 0) {
callback(new Error('chunk is invalid'));
} else {
callback();
}
}
}
```


## Conclusion

Streams are a often used module in Node.js. They are important to the internal
structure, and for develoeprs, to expand and connect across the Node.js modules
ecosystem.

Hopefully, you will now be able to troubleshoot, safely code your own
[`Writable`][] streams with backpressure in mind, and share your knowledge with
colleagues and friends.

Be sure to read up more on [`Streams`][] for other [`EventEmitters`][] to help
improve your knowledge when building applications with Node.js.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what "for other EventEmitters" is trying to say...




[`Streams`]: https://nodejs.org/api/stream.html
[`Buffer`]: https://nodejs.org/api/buffer.html
[`EventEmitters`]: https://nodejs.org/api/events.html
[`Writable`]: https://nodejs.org/api/stream.html#stream_writable_streams
[`Readable`]: https://nodejs.org/api/stream.html#stream_readable_streams
[`Duplex`]: https://nodejs.org/api/stream.html#stream_duplex_and_transform_streams
[`.drain()`]: https://nodejs.org/api/stream.html#stream_event_drain
[`.read()`]: https://nodejs.org/docs/latest/api/stream.html#stream_readable_read_size
[`.write()`]: https://nodejs.org/api/stream.html#stream_writable_write_chunk_encoding_callback
[`._read()`]: https://nodejs.org/docs/latest/api/stream.html#stream_readable_read_size_1
[`._write()`]: https://nodejs.org/docs/latest/api/stream.html#stream_writable_write_chunk_encoding_callback_1

[implementing Writable streams]: https://nodejs.org/docs/latest/api/stream.html#stream_implementing_a_writable_stream
[implementing Readable streams]: https://nodejs.org/docs/latest/api/stream.html#stream_implementing_a_readable_stream

[other packages]: https://github.com/sindresorhus/awesome-nodejs#streams
[`backpressure`]: https://en.wikipedia.org/wiki/Back_pressure#Back_pressure_in_information_technology
[Node.js v0.10]: https://nodejs.org/docs/v0.10.0/
[`highWaterMark`]: https://nodejs.org/api/stream.html#stream_buffering

[`.pipe()`]: https://nodejs.org/docs/latest/api/stream.html#stream_readable_pipe_destination_options
[piped]: https://nodejs.org/docs/latest/api/stream.html#stream_readable_pipe_destination_options