`worker.postMessage` performance #38780

ronag · 2021-05-23T10:54:01Z

I'm using worker.postMessage(myString) to send data to my worker. However, I'm getting significant performance overhead from postMessage. I'm spending 16% of my total cpu time here. The strings are usually quite short but can be up to 256 chars.

Where does the overhead of postMessage mostly come from and is there ways to improve it? I might be intersted in trying to improve it if someone could point me in the right direction.

I've been consdering using a SharedArrayBuffer and writing the strings into the buffer and then Atomics.wait/notify, However, that feels wrong as it basically just re-implements what postMessage could/should be doing anyway. Is MessageChannel faster?

The text was updated successfully, but these errors were encountered:

ronag · 2021-05-23T10:54:38Z

@jasnell @addaleax

addaleax · 2021-05-23T11:40:06Z

Where does the overhead of postMessage mostly come from and is there ways to improve it? I might be intersted in trying to improve it if someone could point me in the right direction.

I did a quick flamegraph on the benchmark for this (env NODE_RUN_BENCHMARK_FN=1 0x --kernel-tracing -o -- ./node benchmark/worker/messageport.js payload=string style=eventemitter n=3000000), you can look at it here.

You can see that (expectedly) the V8 serialization and deserialization machinery does come with some overhead, and that calling into Node.js from C++ when a message is received also comes with quite some overhead.

In terms of improving this, I did look at that a few times, and… I think the low-hanging fruit here are taken, but obviously, a lot of this is worth putting some effort into because things like the call-into-JS machinery affect all of Node.js, not just MessagePort performance.

For example, the v8::Global allocation in AsyncHooks::push_async_context accounts for 2 % of the benchmark here – if we avoid that and use v8::Local when we can and v8::Global only when we need to, we could save that for all async calls from C++ into JS, which would be great, but also make the async resource stack implementation even more complicated than it is.

We could make the group_mutex_ of SiblingGroup a rwlock instead of a plain mutex – that would probably help with BroadcastChannel performance more than generic MessageChannel, but there it would speed up the common case a bit because sending messages is much more common than creating or destroying channels. (→ #38783)

We could pass the context argument explicitly to the inner MessagePort::PostMessage overload, because the JS bindings look that up as well; that would save a GetCreationContext() call, which does take up about 1 % of the benchmark. (→ #38784)

In the end, most things we can do here only shave off a single percent or two. I don’t know if that’s the kind of thing we’re looking for.

I've been consdering using a SharedArrayBuffer and writing the strings into the buffer and then Atomics.wait/notify, However, that feels wrong as it basically just re-implements what postMessage could/should be doing anyway.

If you are in a situation where you can use Atomics.wait, then that’s not a bad idea. We do that in piscina as well – use Atomics.wait in combination with worker_threads.receiveMessageOnPort, which helps us skip the overhead of the callback (including async tracking) into JS on the receiving side. And if you know that your data is a string, then writing it into an SAB is also going to reduce the overhead a bit.

Is MessageChannel faster?

Fast than … ? If you’re comparing it against SAB + Atomics, then, no, that’s never going to be the case.

ronag · 2021-05-23T11:56:32Z

f you are in a situation where you can use Atomics.wait, then that’s not a bad idea. We do that in piscina as well – use Atomics.wait in combination with worker_threads.receiveMessageOnPort, which helps us skip the overhead of the callback (including async tracking) into JS on the receiving side.

The receiving side is not that big of a problem. I'm mostly concerned about the overhead on the main thread.

ronag · 2021-05-23T11:58:36Z

You can see that (expectedly) the V8 serialization and deserialization machinery does come with some overhead, and that calling into Node.js from C++ when a message is received also comes with quite some overhead.

Can the serialization overhead be reduced in someway? I would have hoped that passing string would have low serialization overhead (i.e. just a an malloc + memcpy). Does passing some form of typed/shared buffer have better perf?

benjamingr · 2021-05-23T12:30:01Z

Does passing some form of typed/shared buffer have better perf?

Most likely - yes. I'd assume it also depends on the string itself (whether it's a tree or flattened already for example).

addaleax · 2021-05-23T13:55:58Z

@ronag If you’re worried about the sending side… I don’t know, I guess we could create a fast path for people who only transfer strings/typed arrays that skips the serialization steps, and writes data a bit more directly. It is true that the V8 serializer is adding somewhat undue overhead in that case. This would definitely make the code quite a bit more complex, though,

We can also see if we can avoid std::shared_ptr<Message> in favor of std::unique_ptr<Message> for the non-broadcast cases, that might also help a bit (but again, make the code more complex).

jasnell · 2021-05-23T14:00:14Z

We could make the group_mutex_ of SiblingGroup a rwlock instead of a plain mutex...

That's probably just a good idea in general.

For example, the v8::Global allocation in AsyncHooks::push_async_context accounts for 2 % of the benchmark here – if we avoid that and use v8::Local when we can and v8::Global only when we need to, we could save that for all async calls from C++ into JS, which would be great, but also make the async resource stack implementation even more complicated than it is.

Yeah I was looking at that a while back and decided against doing anything precisely because of the additional complexity. What I kept coming back to is the idea that maybe what would be most helpful is actually teaching v8 how to handle (and optimize) the async context itself so that we wouldn't necessarily have to allocate anything additional per async call into JS. But that's a larger discussion than just postMessage.

Fiddling around with a few percentage points is not going to make a huge difference, no, but I do think it's worthwhile where it's not too much effort.

Since it is much more common to send messages than to add or remove ports from a sibling group, using a rwlock is appropriate here. Refs: nodejs#38780 (comment)

Refs: nodejs#38780 (comment)

ronag · 2021-05-23T14:50:13Z

@ronag If you’re worried about the sending side… I don’t know, I guess we could create a fast path for people who only transfer strings/typed arrays that skips the serialization steps, and writes data a bit more directly. It is true that the V8 serializer is adding somewhat undue overhead in that case. This would definitely make the code quite a bit more complex, though,

We can also see if we can avoid std::shared_ptr<Message> in favor of std::unique_ptr<Message> for the non-broadcast cases, that might also help a bit (but again, make the code more complex).

I think doing a fast path for buffer and string would make sense.

ronag · 2021-05-23T20:43:58Z

We can also see if we can avoid std::shared_ptr<Message> in favor of std::unique_ptr<Message> for the non-broadcast cases, that might also help a bit (but again, make the code more complex).

A little of topic but looking at the source code I'm curious as to why we would even need a unique_ptr? Can't we just make Message movable and use move semantics? The extra allocation seems unnecessary.

Also most of the lists in Message are probably quite short and we could probably get away with using a stack based allocator for std::vector that fallbacks to dynamic allocations when size outgrows initial capacity (a bit like how std::string works behind the scenes).

Also most of the fields in Message are unused most of the time so we could alternatively lower the overhead a bit by allocating them lazily.

struct Data {
  std::vector<std::shared_ptr<v8::BackingStore>> array_buffers_;
  std::vector<std::shared_ptr<v8::BackingStore>> shared_array_buffers_;
  std::vector<std::unique_ptr<TransferData>> transferables_;
  std::vector<v8::CompiledWasmModule> wasm_modules_;
};

MallocedBuffer<char> main_message_buf_;
std::optional<Data> data_; // or std::unique_ptr<Data> data_;

ronag · 2021-05-23T21:18:54Z

Using flattened strings seems to be significantly faster.

addaleax · 2021-05-24T12:20:54Z

We can also see if we can avoid std::shared_ptr<Message> in favor of std::unique_ptr<Message> for the non-broadcast cases, that might also help a bit (but again, make the code more complex).

A little of topic but looking at the source code I'm curious as to why we would even need a unique_ptr? Can't we just make Message movable and use move semantics? The extra allocation seems unnecessary.

Well, yes, but we do want std::shared_ptr when the message actually has multiple recipients (rather than duplicating the mesage). Fwiw, for most of the comments you mentioned, std::variant is really the proper solution here, hence #38788.

Using flattened strings seems to be significantly faster.

I don’t think that’s a surprise, although I would expect that serializing them is also a flattening operation.

ronag · 2021-05-24T13:04:46Z

I don’t think that’s a surprise, although I would expect that serializing them is also a flattening operation.

Yes, but in my case I was doing something like:

worker.postMessage(`ADD/REMOVE_${data}`)

where data is a cached string that has been flattened with flatstr.

However, by doing some magic I managed to change it to:

worker.postMessage(data) // TOGGLE ON/OFF instead off ADD/REMOVE and hope we don't have any state bugs.

It's a bit unfortunate that I don't have a performant way to send meta with a payload.

Before this I also did:

worker.postMessage({ data, type: 'add/remove' })

Improving this could also be relavant for piscina. I guess the most common use for piscina it to offload the main thread.

Since it is much more common to send messages than to add or remove ports from a sibling group, using a rwlock is appropriate here. Refs: #38780 (comment) PR-URL: #38783 Reviewed-By: James M Snell <[email protected]> Reviewed-By: Joyee Cheung <[email protected]>

Refs: nodejs#38780 (comment) PR-URL: nodejs#38784 Reviewed-By: Gireesh Punathil <[email protected]> Reviewed-By: Michaël Zasso <[email protected]> Reviewed-By: James M Snell <[email protected]> Reviewed-By: Colin Ihrig <[email protected]> Reviewed-By: Minwoo Jung <[email protected]>

Refs: #38780 (comment) PR-URL: #38784 Reviewed-By: Gireesh Punathil <[email protected]> Reviewed-By: Michaël Zasso <[email protected]> Reviewed-By: James M Snell <[email protected]> Reviewed-By: Colin Ihrig <[email protected]> Reviewed-By: Minwoo Jung <[email protected]>

Refs: nodejs#38780 (comment) PR-URL: nodejs#38784 Reviewed-By: Gireesh Punathil <[email protected]> Reviewed-By: Michaël Zasso <[email protected]> Reviewed-By: James M Snell <[email protected]> Reviewed-By: Colin Ihrig <[email protected]> Reviewed-By: Minwoo Jung <[email protected]>

ronag added question Issues that look for answers. worker Issues and PRs related to Worker support. labels May 23, 2021

addaleax added the performance Issues and PRs related to the performance of Node.js. label May 23, 2021

addaleax added a commit to addaleax/node that referenced this issue May 23, 2021

worker: use rwlock for sibling group

f77b514

Since it is much more common to send messages than to add or remove ports from a sibling group, using a rwlock is appropriate here. Refs: nodejs#38780 (comment)

addaleax mentioned this issue May 23, 2021

worker: use rwlock for sibling group #38783

Closed

addaleax added a commit to addaleax/node that referenced this issue May 23, 2021

worker: do not look up context twice in PostMessage

4f12682

Refs: nodejs#38780 (comment)

addaleax mentioned this issue May 23, 2021

worker: do not look up context twice in PostMessage #38784

Merged

Mesteery closed this as completed Feb 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`worker.postMessage` performance #38780

`worker.postMessage` performance #38780

ronag commented May 23, 2021 •

edited

Loading

ronag commented May 23, 2021

addaleax commented May 23, 2021 •

edited

Loading

ronag commented May 23, 2021 •

edited

Loading

ronag commented May 23, 2021 •

edited

Loading

benjamingr commented May 23, 2021

addaleax commented May 23, 2021

jasnell commented May 23, 2021

ronag commented May 23, 2021

ronag commented May 23, 2021 •

edited

Loading

ronag commented May 23, 2021

addaleax commented May 24, 2021

ronag commented May 24, 2021 •

edited

Loading

worker.postMessage performance #38780

worker.postMessage performance #38780

Comments

ronag commented May 23, 2021 • edited Loading

ronag commented May 23, 2021

addaleax commented May 23, 2021 • edited Loading

ronag commented May 23, 2021 • edited Loading

ronag commented May 23, 2021 • edited Loading

benjamingr commented May 23, 2021

addaleax commented May 23, 2021

jasnell commented May 23, 2021

ronag commented May 23, 2021

ronag commented May 23, 2021 • edited Loading

ronag commented May 23, 2021

addaleax commented May 24, 2021

ronag commented May 24, 2021 • edited Loading

`worker.postMessage` performance #38780

`worker.postMessage` performance #38780

ronag commented May 23, 2021 •

edited

Loading

addaleax commented May 23, 2021 •

edited

Loading

ronag commented May 23, 2021 •

edited

Loading

ronag commented May 23, 2021 •

edited

Loading

ronag commented May 23, 2021 •

edited

Loading

ronag commented May 24, 2021 •

edited

Loading