Stream performance improvement ideas #117

benjamingr · 2023-09-21T10:03:22Z

cc @anonrig from our chat

Reduce memory overhead by using bit maps instead of many booleans for state
Use a deque instead of a linked list for the BufferList
Optimize for a single pipe rather than an array of pipes and fallback to array of pipes

More wild ideas:

Split Readable into several classes and "bailout" from fast to slow ones (e.g. objectMode and bytes, multiple readers vs. one, flowing vs. non flowing) based on most streams not changing their type during their lifetime.
Don't require a "full" event emitter necessarily for a stream unless the user is listening to an event and "fast-path" if a stream is piped

cc @ronag @nodejs/streams

ronag · 2023-09-21T10:14:34Z

cc @anonrig from our chat

Reduce memory overhead by using bit maps instead of many booleans for state

Not sure whether this is worth the effort. Is memory usage a problem? Number of streams in typical application * state size should be quite low, relatively speaking? I tried this once but the performance impact was negligible. I was thinking that being able to keep more of the state in a single cache line should make a impact but I was wrong. Might be different now...

Use a deque instead of a linked list for the BufferList

I think adding unshift to FixedQueue would be one way to go?

Optimize for a single pipe rather than an array of pipes and fallback to array of pipes

Not sure what this is? We already optimized for single pipe? What am I missing

More wild ideas:

Split Readable into several classes and "bailout" from fast to slow ones (e.g. objectMode and bytes, multiple readers vs. one, flowing vs. non flowing) based on most streams not changing their type during their lifetime.

Difficult.

Don't require a "full" event emitter necessarily for a stream unless the user is listening to an event and "fast-path" if a stream is piped

Skeptical if this is worth it.

Other stuff:

add support in pipeline for sync transformer function
Rewrite transform stream as a Readable which implements the Writable trait. Avoiding the double buffering from Duplex.
Try to avoid nextTick calls. I have a PR for this.

ronag · 2023-09-21T10:25:03Z

Do we have any actual profile data to back our efforts?

ronag · 2023-09-21T10:26:59Z

TBH. I think the biggest impact we can have is to try and increase the highwatermark across the board, e.g. increasing from 16k to 128k has a theoretical perf improvement of 8x in the context of streams overhead.

benjamingr · 2023-09-21T10:28:39Z

Not sure whether this is worth the effort. Is memory usage a problem? Number of streams in typical application * state size should be quite low, relatively speaking? I tried this once but the performance impact was negligible. I was thinking that being able to keep more of the state in a single cache line should make a impact but I was wrong. Might be different now...

i'll open a PR and we can check but since streams are so omnipresent I would suspect so especially when people construct smaller streams (e.g. to use the iterator helpers). I feel like doing the work and seeing would be easier and I feel like I got sucked into a lot more talking than coding roles recently so that would be fun.

I think adding unshift to FixedQueue would be one way to go?

That would work

Not sure what this is? We already optimized for single pipe? What am I missing

We always store a .pipes array even though most streams only pipe to one place, we could make the default behavior (one pipe) faster and bail out on multiple pipes.

Do we have any actual profile data to back our efforts?

No, I went through the systems analyzer with --log-all and noticed some stuff (optimzing streams wasn't the meeting topic V8 was)

ronag · 2023-09-21T10:30:48Z

We always store a .pipes array even though most streams only pipe to one place, we could make the default behavior (one pipe) faster and bail out on multiple pipes.

I don't see how this affects any hot path.

benjamingr · 2023-09-21T10:33:58Z

Basically all streams become slightly smaller and lookup for the pipe becomes slightly faster which would have very small impact in "many big streams" but measurable in "many small streams"

benjamingr · 2023-09-21T10:49:23Z

Said PR nodejs/node#49745

mcollina · 2023-09-21T11:42:07Z

The main performance bottleneck in streams comes from the cost of buffering at each step of a stream pipeline. From both an algorithmic and memory perspective, this is heavy. Worsely, we do this double buffering in http, because both the socket and IncomingRequests are Readable.

We should internalize a library such as syncthrough or minipass (or one built with the same principles) that does not perform any buffering. Look at the benchmarks.

benjamingr · 2023-09-21T12:57:17Z

@mcollina we can explore/add a BYOB fast mode where different steps of the transform mutate the data in place instead of copying it which should ideally mean 0 copies

mcollina · 2023-09-21T13:29:11Z

we can explore/add a BYOB fast mode where different steps of the transform mutate the data in place instead of copying it which should ideally mean 0 copies

That is impossible with the current API. We did a full design for this a long time ago, but there was no budget to make it happen. Using synchthrough/minipass where you'd use a Transform is a good intermediate step (you'd need this anyway for a BYOB solution).

benjamingr · 2023-09-21T16:49:09Z

@mcollina why? We keep the same design and just emit the buffers that were passed to us, we keep a pointer to what we're currently using etc?

benjamingr · 2023-09-21T21:02:40Z

@ronag trying FixedQueue suggestion in nodejs/node#49754

mcollina · 2023-09-21T21:44:39Z

For a BYOB strategy to work, you need to "lock" the two streams sides so that others could not interject and inspect that. Our current stream implementation is designed as a one-to-many abstraction.

benjamingr · 2023-09-22T19:37:33Z

For a BYOB strategy to work, you need to "lock" the two streams sides so that others could not interject and inspect that.

Correct, we could for example have a base class that is not an event emitter and only pipelines (and composes etc) and listening to events on that would throw or some such. Would need a lot of thought but can happen I think?

mcollina · 2023-09-22T22:39:17Z

The design for that was already done in https://github.com/Fishrock123/bob. I think this was 2017-2018 work.

Uzlopak · 2023-09-23T09:01:20Z

In Readable and ReadableState we do a stream instanceof DuplexStream. Same with Writable.

Is there a way to avoid the instanceof? maybe using an symbol attribute kIsDuplexStream set to true DuplexStream in constructor and checked by Writable and Readable

benjamingr · 2023-09-23T12:58:27Z

@Uzlopak sure though that would need to be a "fast path" while we still support instanceof for ecosystem compat

@mcollina bob was a different thing than I had in mind and a complete overhaul though?

Uzlopak · 2023-09-23T13:11:44Z

I checked it locally, and it would break for Writable.

benjamingr · 2023-09-23T13:20:04Z

@Uzlopak why?

Uzlopak · 2023-09-23T13:37:56Z

Idk. But if you check... Writable has also not the following check

  if (!(this instanceof Writable))
    return new Writable(options);

vweevers · 2023-09-23T13:56:22Z

Some object mode Readable implementations could benefit from a new pushv API. Those that get data from their source in bulk and currently do for (const chunk of chunks) this.push(chunk) which could instead be this.pushv(chunks) to skip repeated state checks.

ronag · 2023-09-23T17:28:06Z

Another thing we could look into is to merge the Readable from undici into core. It has some smart tricks to reduce overhead.

https://github.com/nodejs/undici/blob/main/lib/api/readable.js

benjamingr · 2023-09-26T13:49:42Z

Sitting together with Ronag he pointed out some more important/useful ideas for the bitmap:

errored -> move to flags (and mutate shape on error - make non-error path faster)
decoder -> kHasDecoder and optimize for fast path
encoding -> optimize for utf-8 or buffer and if it's different then mutate shape
defaultEncoding -> similarly (utf8 or buffer or different)
pipes -> optimize for single pipe? Why the heck do we keep a ref to the pipe?
flowing -> convert to 2 bits (true/false/null)
kPaused -> can just be a flag in the bitmap since that's private

We should also optimize for non-legacy streams and keep compat in a slow path.

Uzlopak · 2023-09-27T16:48:40Z

I know it sounds ignorant... Why do we call destroy in next tick? All the other operations are done in the current loop. What makes destroy so special?

mcollina · 2023-09-27T16:54:59Z

I know it sounds ignorant... Why do we call destroy in next tick? All the other operations are done in the current loop. What makes destroy so special?

To support this flow:

const server = net.createServer((socket) => {
  socket.on('error', ...)
})

If we did not emit on the next tick, we could have streams that are errored before the user has a chance to attach an handler. Unfortunately this pattern is pervasive in our codebase and the ecosystem. A possible optimization in this regard is to not nextTick if there is a listener.

Uzlopak · 2023-09-27T17:43:22Z

Then we should consider this. Actually nextTick is the sole reason, that destroy is slower than the other ops.

benjamingr · 2023-09-27T21:05:10Z

nextTick isn't that expensive (though its queue should use a deque if it doesn't assuming we can't move to V8 queueMicrotask for timing reasons) but the nextTick is also there to provide consistent ordering so code isn't sometimes synchronous it always fires listeners asynchronously

ronag · 2023-10-04T15:01:57Z

Current state of recent improvements vs 20.7.0:

streams/readable-async-iterator.js sync='no' n=100000                                           ***      8.98 %       ±2.22%  ±3.09%  ±4.31%
streams/readable-async-iterator.js sync='yes' n=100000                                                   9.42 %       ±9.77% ±13.45% ±18.45%
streams/readable-bigread.js n=1000                                                                *      2.85 %       ±2.41%  ±3.44%  ±5.01%
streams/readable-bigunevenread.js n=1000                                                        ***     18.86 %       ±5.99%  ±8.44% ±12.03%
streams/readable-boundaryread.js type='buffer' n=2000                                            **     -2.81 %       ±1.83%  ±2.60%  ±3.78%
streams/readable-boundaryread.js type='string' n=2000                                             *     -1.24 %       ±1.04%  ±1.42%  ±1.94%
streams/readable-from.js n=10000000                                                             ***     -5.65 %       ±2.69%  ±3.69%  ±5.03%
streams/readable-readall.js n=5000                                                              ***      1.53 %       ±0.72%  ±1.01%  ±1.43%
streams/readable-unevenread.js n=1000                                                                    0.62 %       ±0.70%  ±0.96%  ±1.31%
streams/pipe-object-mode.js n=5000000                                                            **      7.60 %       ±4.26% ±5.85% ±7.97%
streams/pipe.js n=5000000                                                                         *     -3.14 %       ±2.61% ±3.58% ±4.88%
streams/writable-manywrites.js len=1024 callback='no' writev='no' sync='no' n=100000            ***      6.28 %       ±2.66%  ±3.72%  ±5.26%
streams/writable-manywrites.js len=1024 callback='no' writev='no' sync='yes' n=100000           ***     20.64 %       ±3.17%  ±4.37%  ±6.00%
streams/writable-manywrites.js len=1024 callback='no' writev='yes' sync='no' n=100000                    4.72 %       ±4.95%  ±6.96%  ±9.88%
streams/writable-manywrites.js len=1024 callback='no' writev='yes' sync='yes' n=100000          ***     31.86 %       ±3.09%  ±4.24%  ±5.81%
streams/writable-manywrites.js len=1024 callback='yes' writev='no' sync='no' n=100000                    5.64 %       ±5.93%  ±8.49% ±12.41%
streams/writable-manywrites.js len=1024 callback='yes' writev='no' sync='yes' n=100000          ***     11.81 %       ±3.08%  ±4.29%  ±5.98%
streams/writable-manywrites.js len=1024 callback='yes' writev='yes' sync='no' n=100000           **      5.64 %       ±3.65%  ±5.10%  ±7.15%
streams/writable-manywrites.js len=1024 callback='yes' writev='yes' sync='yes' n=100000                 20.52 %      ±25.07% ±34.52% ±47.41%
streams/writable-manywrites.js len=32768 callback='no' writev='no' sync='no' n=100000           ***      9.54 %       ±3.76%  ±5.28%  ±7.49%
streams/writable-manywrites.js len=32768 callback='no' writev='no' sync='yes' n=100000          ***    298.42 %       ±5.77%  ±7.97% ±11.00%
streams/writable-manywrites.js len=32768 callback='no' writev='yes' sync='no' n=100000           **     11.73 %       ±6.77%  ±9.53% ±13.57%
streams/writable-manywrites.js len=32768 callback='no' writev='yes' sync='yes' n=100000         ***    263.39 %       ±8.58% ±11.95% ±16.72%
streams/writable-manywrites.js len=32768 callback='yes' writev='no' sync='no' n=100000           **      7.93 %       ±5.03%  ±7.20% ±10.53%
streams/writable-manywrites.js len=32768 callback='yes' writev='no' sync='yes' n=100000         ***    258.94 %       ±8.28% ±11.86% ±17.33%
streams/writable-manywrites.js len=32768 callback='yes' writev='yes' sync='no' n=100000          **      9.24 %       ±5.66%  ±7.80% ±10.71%
streams/writable-manywrites.js len=32768 callback='yes' writev='yes' sync='yes' n=100000        ***    233.27 %       ±4.76%  ±6.74%  ±9.71%
streams/transform-manywrites.js len=1024 callback='no' sync='no' n=100000                       ***     13.15 %       ±2.51%  ±3.45%  ±4.74%
streams/transform-manywrites.js len=1024 callback='no' sync='yes' n=100000                      ***     22.76 %       ±4.67%  ±6.51%  ±9.09%
streams/transform-manywrites.js len=1024 callback='yes' sync='no' n=100000                              14.59 %      ±17.30% ±24.48% ±35.12%
streams/transform-manywrites.js len=1024 callback='yes' sync='yes' n=100000                             17.09 %      ±22.63% ±31.08% ±42.50%
streams/transform-manywrites.js len=32768 callback='no' sync='no' n=100000                      ***     14.05 %       ±2.87%  ±4.00%  ±5.61%
streams/transform-manywrites.js len=32768 callback='no' sync='yes' n=100000                     ***    188.51 %      ±27.29% ±39.20% ±57.63%
streams/transform-manywrites.js len=32768 callback='yes' sync='no' n=100000                     ***     11.64 %       ±5.04%  ±6.97%  ±9.64%
streams/transform-manywrites.js len=32768 callback='yes' sync='yes' n=100000                    ***    190.91 %       ±8.86% ±12.44% ±17.62%
streams/creation.js kind='duplex' n=5000000                                                     ***     29.24 %       ±5.16%  ±7.31% ±10.50%
streams/creation.js kind='readable' n=5000000                                                   ***     27.02 %       ±1.11%  ±1.53%  ±2.09%
streams/creation.js kind='transform' n=5000000                                                  ***     39.30 %       ±9.82% ±13.57% ±18.74%
streams/creation.js kind='writable' n=5000000                                                   ***     49.21 %       ±2.42%  ±3.33%  ±4.54%
streams/destroy.js kind='duplex' n=100000                                                       ***     34.21 %       ±4.67%  ±6.46%  ±8.93%
streams/destroy.js kind='readable' n=100000                                                     ***     21.11 %      ±10.58% ±14.68% ±20.41%
streams/destroy.js kind='transform' n=100000                                                    ***     29.88 %       ±7.28% ±10.02% ±13.76%
streams/destroy.js kind='writable' n=100000                                                              6.64 %       ±7.53% ±10.73% ±15.58%

ronag · 2023-10-04T15:04:56Z

@benjamingr @mcollina ☝️

benjamingr · 2023-10-04T15:07:03Z

That's great

anonrig · 2023-10-04T15:26:53Z

Good work!

benjamingr · 2023-10-04T15:32:16Z

We can:

Still make the impl a lot more v8 friendly
Figure out how to buffer less for people who pipeline things
- Figure out efficient allocation and memory layout more generally
- Stop using linked lists for like anything ever
Figure out if we can remove legacy compatibility for certain things to userland
Ideally make our web streams as fast as regular streams (doable)

H4ad mentioned this issue Oct 2, 2023

Node.js Performance Team Meeting 2023-10-02 #121

Closed

anonrig mentioned this issue Oct 4, 2023

What's new for the Node.js Performances? - 2023 version #98

Open

benjamingr mentioned this issue Oct 25, 2023

streams: use Array for Readable buffer nodejs/node#50341

Merged

H4ad mentioned this issue Nov 28, 2023

High memory consumption when piping between streams nodejs/node#50762

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stream performance improvement ideas #117

Stream performance improvement ideas #117

benjamingr commented Sep 21, 2023

ronag commented Sep 21, 2023 •

edited

Loading

ronag commented Sep 21, 2023

ronag commented Sep 21, 2023

benjamingr commented Sep 21, 2023

ronag commented Sep 21, 2023

benjamingr commented Sep 21, 2023

benjamingr commented Sep 21, 2023

mcollina commented Sep 21, 2023

benjamingr commented Sep 21, 2023

mcollina commented Sep 21, 2023

benjamingr commented Sep 21, 2023

benjamingr commented Sep 21, 2023

mcollina commented Sep 21, 2023

benjamingr commented Sep 22, 2023

mcollina commented Sep 22, 2023

Uzlopak commented Sep 23, 2023 •

edited

Loading

benjamingr commented Sep 23, 2023

Uzlopak commented Sep 23, 2023

benjamingr commented Sep 23, 2023

Uzlopak commented Sep 23, 2023

vweevers commented Sep 23, 2023

ronag commented Sep 23, 2023

benjamingr commented Sep 26, 2023

Uzlopak commented Sep 27, 2023

mcollina commented Sep 27, 2023

Uzlopak commented Sep 27, 2023

benjamingr commented Sep 27, 2023

ronag commented Oct 4, 2023 •

edited

Loading

ronag commented Oct 4, 2023

benjamingr commented Oct 4, 2023

anonrig commented Oct 4, 2023

benjamingr commented Oct 4, 2023

Stream performance improvement ideas #117

Stream performance improvement ideas #117

Comments

benjamingr commented Sep 21, 2023

ronag commented Sep 21, 2023 • edited Loading

ronag commented Sep 21, 2023

ronag commented Sep 21, 2023

benjamingr commented Sep 21, 2023

ronag commented Sep 21, 2023

benjamingr commented Sep 21, 2023

benjamingr commented Sep 21, 2023

mcollina commented Sep 21, 2023

benjamingr commented Sep 21, 2023

mcollina commented Sep 21, 2023

benjamingr commented Sep 21, 2023

benjamingr commented Sep 21, 2023

mcollina commented Sep 21, 2023

benjamingr commented Sep 22, 2023

mcollina commented Sep 22, 2023

Uzlopak commented Sep 23, 2023 • edited Loading

benjamingr commented Sep 23, 2023

Uzlopak commented Sep 23, 2023

benjamingr commented Sep 23, 2023

Uzlopak commented Sep 23, 2023

vweevers commented Sep 23, 2023

ronag commented Sep 23, 2023

benjamingr commented Sep 26, 2023

Uzlopak commented Sep 27, 2023

mcollina commented Sep 27, 2023

Uzlopak commented Sep 27, 2023

benjamingr commented Sep 27, 2023

ronag commented Oct 4, 2023 • edited Loading

ronag commented Oct 4, 2023

benjamingr commented Oct 4, 2023

anonrig commented Oct 4, 2023

benjamingr commented Oct 4, 2023

ronag commented Sep 21, 2023 •

edited

Loading

Uzlopak commented Sep 23, 2023 •

edited

Loading

ronag commented Oct 4, 2023 •

edited

Loading