-
Notifications
You must be signed in to change notification settings - Fork 6.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add guide: "Backpressuring in Streams" #1109
Conversation
cc: @nodejs/streams |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The work is solid, I think the intro should be changed to explain what is the problem that streams solve.
data and the _consumer_. | ||
|
||
In Node.js the source is a [`Readable`][] stream and the consumer is the | ||
[`Writable`][] stream (both of these may be interchanged with a [`Duplex`][] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or a Transform
help expand your understanding while reading this guide and for following along | ||
with examples. | ||
|
||
## The Problem with Streams |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is not a problem with streams. There is a problem that streams solve, which is backpressure, also called flow control in other communities. There are other solutions to backpressure. Unix pipes and TCP sockets solves this as well, and are the root of the following. Also, streams can be used for pull-based data handling.
The usual demo I do to show the problem is zipping a very large data file both in one go and with streams. As the first one crashes, the second one can work with data of any size. (the original credit goes to @davidmarkclements for that example). That shows the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this review @mcollina. So essentially there is not separation from streams & backpressure? As in ... data-handling is tricky, but streams solves it with backpressure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in "data handling is tricky, and streams is the solution that Node has adopted".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Syntactically, I am still a bit confused. Wikipedia describes backpressure in IT as
the build-up of data behind an I/O switch if the buffers are full and incapable of receiving any more data; the transmitting device halts the sending of data packets until the buffers have been emptied and are once more capable of storing information.
In this instance .. is backpressure the problem or the I/O switch that streams implements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this instance .. is backpressure the problem or the I/O switch that streams implements?
Backpressure itself is the problem, and the backpressure handling mechanisms in the streams
classes are our solution to it. :)
+------------+ add chunk to queue | | ||
| <--^-------------------< | ||
+============+ | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This diagram is not exactly clear. Why everything is going back to pipe()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. I assumed every event emitter was being found back into pipe, and then pipe would delegate what would do next. Is that incorrect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jessicaquynh it is incorrect. pipe
sets up some closures, who would keep working, but the method is not invoked anymore. The diagram should represent what pipe
set up, but not show pipe
at all: https://github.com/nodejs/node/blob/master/lib/_stream_readable.js#L473-L613
|
||
* If the write queue is busy, [`.write()`][] will return false. | ||
* If the data chunk is too large, [`.write()`][] will return false (the limit | ||
is indicated by the variable, [`highWaterMark`][]). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are building a Writable
directly, this is not needed, as it is handled by the stream state machine. However, it is needed to use a Writable
directly, see my example in the Node docs: https://github.com/nodejs/node/blob/master/doc/api/stream.md#writablewritechunk-encoding-callback
@jessicaquynh can you incorporate Matteo's comments? |
@eljefedelrodeodeljefe yes definitely! sorry I've been mia on this issue .. will submit changes tomorrow night! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, and I'm excited to read the currently unfinished sections.
One thing I noticed is the emphasis on writable streams. Writing readable streams that stop when asked to do so is just as important as teaching the writable stream to say stop when needed IMO.
|
||
Once the the queue is finished, backpressure will allow data to be sent again. | ||
The space in memory that was being used will free itself up and prepare for the | ||
next glob of data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
glob → blob / batch
|
||
# Backpressuring in Streams | ||
|
||
The purpose of this guide will describe what backpressure is and the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will → is to
|
||
This effectively allows an fixed amount of memory to be used at any given | ||
time for a [`.pipe()`][] function. There will be no memory leakage, no | ||
indefinite buffering, and the garbage collector will only have to deal with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indefinite → infinite
|
||
## Build a WritableStream | ||
|
||
Let's extend the prototypical function of [`.write()`][]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are not extending .write()
here, but extending stream.Writable
with a custom ._write()
method, which is then used internally by .write()
.
value for what system is running the application. In instances where you might | ||
want to raise that value, go for it, but do so with caution! | ||
|
||
## Build a WritableStream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Writable Stream" (note the space) for consistency
var fs = require('fs'); | ||
|
||
var inputFile = fs.createReadStream('REALLY_BIG_FILE.x'); | ||
var outputFile = fs.createWriteStream('REALLY_BIG_FILE_DEST.x'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const
var inputFile = fs.createReadStream('REALLY_BIG_FILE.x'); | ||
var outputFile = fs.createWriteStream('REALLY_BIG_FILE_DEST.x'); | ||
|
||
// Secretly the stream is saying: "whoa, whoa! hang on, this is way too much!" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not immediately clear that writing to the disk is slower than reading. Either mention it explicitly ("on a computer where writing is slower than reading"), or use a slower writable stream (http? zlib?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Writing to the disk is slower than reading from it. It might not be clear for a reader though, so I'm 👍 in adding a zlib compression before writing down.
one area in memory! | ||
|
||
So, if backpressure is so important, why have you (probably) not heard of it? | ||
Well the answer is simple: Node.js does all of this automatically for you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make it be known somewhere that only .pipe()
does this automatically (readable.on('data', buf => writable.write(buf));
doesn't use backpressure, for example). I know in the next section you are looking more closely at .pipe()
, but it seems to be assumed that the reader knows all this is talking about .pipe()
.
an isolate application, both [`Readable`][] and [`Writable`][] streams | ||
should be present. If you are writing an application meant to accept a | ||
[`Readable`][] stream, or pipes to a [`Writable`][] stream from another app, | ||
you may omit this detail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what "application" / "app" means here. A standalone program? A library? A function?
## Example App | ||
|
||
Since [Node.js v0.10][], the [`Streams`][] class has offered the ability to | ||
overwrite the functionality of [`.read()`][] or [`.write()`][] by using the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't say "overwrite", but rather "modify the behavior"
As we have a backpressure (or flow-control) mechanism in place, the maximum amount of data that can flow is controlled by downstream, hence the focus on |
Sorry it took so long to update the guide. Hope I interpreted the change requests correctly. Ultimately, I am still not 100% what constitutes best practice for building streams. Would it be fair to say that best practice is also non-contradictory practice? As in, "you have a good stream if you did not do A, B, C"? |
I am almost always dropping 50-70% of the "best practices" for Streams. There are certain rules to follow, all the rest can be overruled. However, I would not recommend anyone to try writing something like https://github.com/mcollina/syncthrough without a great experience in the internals.
Yes! The golden rule of streams is to always respect backpressure, A) never The third rule: C) streams changes between different node versions, and the library you use. Be careful and test things. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, this is great work. We are getting close, thanks so much for the effort in this. There is one thing that is a must to fix, and the rest is bonus content.
+------------+ add chunk to queue | | ||
| <---^---------------------< | ||
+============+ | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This diagram is making good progress! But I think you should split pipe()
in its own step of the diagram, and not make it part of the "loop". The event handlers are, so we should show the distinction.
The syntax for events, e.g. drain()
, is not common throughout the docs, so we should probaly use on('drain', cb)
instead.
also cc @lrlna who did some visualization on this in the past on paper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if by this you meant something like:
+===================+
+--> Piping functions +---> src.pipe(dest) |
x are set up during |===================|
x the .pipe method. | Event callbacks |
+===============+ x |-------------------|
| Your Data | x They exist outside | .on('close', cb) |
+=======+=======+ x the data flow, but | .on('data', cb) |
| x importantly attach | .on('drain', cb) |
| x events, and their | .on('unpipe', cb) |
+-------v-----------+ x respective callbacks. | .on('error', cb) |
| Readable Stream +----+ | .on('finish', cb) |
+-^-------^-------^-+ | | .on('end', cb) |
^ | ^ | +-------------------+
| | | |
| ^ | |
^ ^ ^ | +-------------------+ +=================+
^ | ^ +----> Writable Stream +---------> .write(chunk) |
| | | +-------------------+ +=======+=========+
Or even have more explicit steps come in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, something like that is way better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi! this is so so good, i love it!
in terms of the diagram, i find the loop a little weird to get around and understand, although i definitely see where you're coming from.
What about making it a bit more linear? I am not good with replicating the fancy thing you have going, but something like this on paper?
an isolate application, both [`Readable`][] and [`Writable`][] streams | ||
should be present. If you are writing an application meant to accept a | ||
[`Readable`][] stream, or pipes to a [`Writable`][] stream from another | ||
component, you may omit this detail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add a paragraph or a section here to discuss streams pipelines. What happens when a transform is in place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh this might be rhetorical question, but I will answer just in case ... when Transform is in the pipeline, it takes in the Readable output and does the ._transform()
and pipes that into the destination Writable? Do you mean to explicate this process in the guide?
Edit: Jw, do Transform Streams or Duplex Streams have special conditions to respect backpressure as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A Transform
has both an outgoing and an incoming highWaterMark
. Most people write Transform
to setup pipelines, so it's important to explain how it all plays out when you chain multiple streams.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, these are the true facts say about a Transform
stream:
It is important because its used often to setup pipelines. In it, there are _write()
and _read()
calls that must be implemented with the same care as with Readable
and Writable
(with respect to the return values .push()
and .write()
. And also HWM may change along the pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not correct. Transform
implementors should add _transform
, as it is a Duplex
that implement both _read
and _write
, i.e. Transform
has those method implemented.
|
||
} | ||
} | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need an equal section on 'Write a Readable Stream'.
We should mention that [readable-stream
][https://github.com/nodejs/readable-stream] is the best way to build a stream application across multiple versions of Node.js, and the browser too! This will make your application have a stable base, and an implementation that changes with its own lifecycle compared to node core.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a general question, is the primarily Node.js-based Readable
stream still the best option in the browser? In my experience, the most often cause of using Node.js Readable
in the browser is as a shim when creating Browserify bundle. At the same time, readable-stream
IIRC depends on additional polyfills like one for events
module. Plus, I know WHATWG has been working on their version of streams.
Regarding using the userland module in Node.js, definitely. You may also want to link to @rvagg's blog post
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
readable-stream
is the way to go in the browser. readable-stream
exports Readable
& co, so all that is written here is valid there as well.
By using streams in the browser, I mean mostly to leverage code already written elsewhere for the purpose, i.e. the streams ecosystem.
const inp = fs.createReadStream('The.Matrix.1080p.mkv'); | ||
const out = fs.createWriteStream('The.Matrix.1080p.mkv.gz'); | ||
|
||
inp.pipe(gzip).pipe(out); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should add a note here and refer to pump
. It's a must have, as this example does not handle error correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the vital difference between chaining pipes together vs pump have to do with error handling? As in chaining pipes together can be erroneous due to the fact there is no error handling? Heh, sorry for my ignorance!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. the problem is that in case of errors the underlining file descriptor on the other side might not get closed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great to see how this is coming along. Apologies for the number of nits, just want to make sure those don't get forgotten when this is applied.
``` | ||
|
||
While that will take a few minutes to compress, in another shell we may run | ||
a script that takes Node.js' module [`Zlib`][], that wraps around another compression tool, [`gzip(1)`][]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"zlib
" is how the documentation refers to itself
inp.pipe(gzip).pipe(out); | ||
``` | ||
|
||
While the first `zip` function will ultimately fail, Node will be able to handle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what the "zip
function" is referring to. The command run in the shell?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! I was referencing the example mcollina was describing. I hope I did it right, should this be more explicit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "function" was throwing me off a bit. And also why would it fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not entirely sure. My layman's understanding of its failure is because of the lack of backpressure mechanism. There is a lot of data that is being compressed and the .zip file gets corrupted. But if @mcollina might be able to shed a technical light on this, it'd be much appreciated!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why I asked is that I'm fairly certain zip
will not encounter this problem. Unlike Node.js, zip
and other UNIX utilities operate synchronously. What that means is that they only read()
as much as they need at one time, and immediately write()
it out afterwards. Since these two system calls are synchronous, there would not be any backpressure problems since by definition they only appear in an asynchronous system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. So to be clear, synchronous systems do not encounter backpressure buildup like async does?
I also did test out the zip
tool and it does, in fact, corrupt the large file, though I am not sure why, if those are the mechanics behind data flow. (Though gzip
on my UNIX command line does not corrupt the large file.)
Any thoughts? My only guess could be because the filetype I am using is video!
* A very overworked garbage collector | ||
* Memory exhaustion | ||
|
||
## Overall Dip in System Performance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this heading maybe. The following paragraph seems to directly follow the list above.
`1.52 gb`. | ||
|
||
Without streams in place to delegate the backpressure, there is a memory space | ||
with an entire degree of magnitude greater that is being allocated. That is a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "order of magnitude" is the more common expression
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tbh I went back to edit this many times and couldn't figure out why it sounded so off .. hehe thanks!
``` | ||
|
||
The maximum byte size occupied by virtual memory turns out to be approximately | ||
`1.52 gb`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
|
||
## Example App | ||
|
||
Since [Node.js v0.10][], the [`Streams`][] class has offered the ability to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stream
class
|
||
} | ||
} | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a general question, is the primarily Node.js-based Readable
stream still the best option in the browser? In my experience, the most often cause of using Node.js Readable
in the browser is as a shim when creating Browserify bundle. At the same time, readable-stream
IIRC depends on additional polyfills like one for events
module. Plus, I know WHATWG has been working on their version of streams.
Regarding using the userland module in Node.js, definitely. You may also want to link to @rvagg's blog post
colleagues and friends. | ||
|
||
Be sure to read up more on [`Streams`][] for other [`EventEmitters`][] to help | ||
improve your knowledge when building applications with Node.js. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what "for other EventEmitters
" is trying to say...
// stream is ready or not. | ||
readable.on('data', data => | ||
writable.write(data); | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might want to add braces around writable.write(data)
just to be clear on
doesn't care about the return value of write()
.
is full (which will vary across different machines). Node.js allows you to set | ||
your own custom [`highWaterMark`][], but commonly, the default is the optimal | ||
value for what system is running the application. In instances where you might | ||
want to raise that value, go for it, but do so with caution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section only deals with rules for writing to a Writable stream from outside the stream. We also need rules for writing to a Readable stream from inside that Readable stream (by respecting the return value of this.push()
).
Sorry for the static on this! I took a look back on this, realized some grammar was off and fixed some formatting. Added in a few pointers mentioned earlier in the thread, though they didn't necessarily close off all the reviews. Skimmed out the example application because I felt that was redundant since there are a rich supply of examples found in the documentation, and I didn't want to reinvent the wheel. I only have one question and that's in regards to the implementation of a Writable stream. Are there any mistakes one can make in Should the guide extend its scope and talk about creating a custom writable wtream from scratch then? As in:
Thanks for all the help! Hope to land this soon! |
@jessicaquynh ws.cork()
ws.write('hello ')
ws.write('world ')
ws.uncork()
ws.cork()
ws.write('from ')
ws.write('Matteo')
ws.uncork() This will cause two transfers into C++ realm, making the use of cork() a bit useless the correct pattern is: ws.cork()
ws.write('hello ')
ws.write('world ')
process.nextTick(doUncork, ws)
ws.cork()
ws.write('from ')
ws.write('Matteo')
process.nextTick(doUncork, ws)
// as a global function
function doUncork (stream) {
stream.uncork()
}
|
@mcollina Thank you for your input and help! I've added the changes to the writable section! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a nit
is full (which will vary across different machines). Node.js allows you to set | ||
your own custom [`highWaterMark`][], but commonly, the default is the optimal | ||
value for what system is running the application. In instances where you might | ||
want to raise that value, go for it, but do so with caution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should write here the defaults, it usually helps people
LGTM |
Great, thank you! |
This is was an issue opened on the nodejs/node repository before guides were transferred directly to node/nodejs.org.
This PR contains a first go at a guide on backpressures, and missing some vital points that I do not feel equipped to answer. I have left some empty spaces for them! I would appreciate anyone's input on the content, structure, and any additions to the guide.
Some points to expand on would be the problems that streams pose which are solved by backpressure. I think I hit most of the points, but I don't think I'm versed enough to go into detail.
There is also space for an example app, but again, I am not familiar on the current best practices on building streams and for implementing your own backpressure.
Thanks!
Refs: nodejs/node#10766