Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

worker.postMessage performance #38780

Closed
ronag opened this issue May 23, 2021 · 12 comments
Closed

worker.postMessage performance #38780

ronag opened this issue May 23, 2021 · 12 comments
Labels
performance Issues and PRs related to the performance of Node.js. question Issues that look for answers. worker Issues and PRs related to Worker support.

Comments

@ronag
Copy link
Member

ronag commented May 23, 2021

I'm using worker.postMessage(myString) to send data to my worker. However, I'm getting significant performance overhead from postMessage. I'm spending 16% of my total cpu time here. The strings are usually quite short but can be up to 256 chars.

Where does the overhead of postMessage mostly come from and is there ways to improve it? I might be intersted in trying to improve it if someone could point me in the right direction.

I've been consdering using a SharedArrayBuffer and writing the strings into the buffer and then Atomics.wait/notify, However, that feels wrong as it basically just re-implements what postMessage could/should be doing anyway. Is MessageChannel faster?

@ronag ronag added question Issues that look for answers. worker Issues and PRs related to Worker support. labels May 23, 2021
@ronag
Copy link
Member Author

ronag commented May 23, 2021

@jasnell @addaleax

@addaleax addaleax added the performance Issues and PRs related to the performance of Node.js. label May 23, 2021
@addaleax
Copy link
Member

addaleax commented May 23, 2021

Where does the overhead of postMessage mostly come from and is there ways to improve it? I might be intersted in trying to improve it if someone could point me in the right direction.

I did a quick flamegraph on the benchmark for this (env NODE_RUN_BENCHMARK_FN=1 0x --kernel-tracing -o -- ./node benchmark/worker/messageport.js payload=string style=eventemitter n=3000000), you can look at it here.

You can see that (expectedly) the V8 serialization and deserialization machinery does come with some overhead, and that calling into Node.js from C++ when a message is received also comes with quite some overhead.

In terms of improving this, I did look at that a few times, and… I think the low-hanging fruit here are taken, but obviously, a lot of this is worth putting some effort into because things like the call-into-JS machinery affect all of Node.js, not just MessagePort performance.

For example, the v8::Global allocation in AsyncHooks::push_async_context accounts for 2 % of the benchmark here – if we avoid that and use v8::Local when we can and v8::Global only when we need to, we could save that for all async calls from C++ into JS, which would be great, but also make the async resource stack implementation even more complicated than it is.

We could make the group_mutex_ of SiblingGroup a rwlock instead of a plain mutex – that would probably help with BroadcastChannel performance more than generic MessageChannel, but there it would speed up the common case a bit because sending messages is much more common than creating or destroying channels. (→ #38783)

We could pass the context argument explicitly to the inner MessagePort::PostMessage overload, because the JS bindings look that up as well; that would save a GetCreationContext() call, which does take up about 1 % of the benchmark. (→ #38784)

In the end, most things we can do here only shave off a single percent or two. I don’t know if that’s the kind of thing we’re looking for.

I've been consdering using a SharedArrayBuffer and writing the strings into the buffer and then Atomics.wait/notify, However, that feels wrong as it basically just re-implements what postMessage could/should be doing anyway.

If you are in a situation where you can use Atomics.wait, then that’s not a bad idea. We do that in piscina as well – use Atomics.wait in combination with worker_threads.receiveMessageOnPort, which helps us skip the overhead of the callback (including async tracking) into JS on the receiving side. And if you know that your data is a string, then writing it into an SAB is also going to reduce the overhead a bit.

Is MessageChannel faster?

Fast than … ? If you’re comparing it against SAB + Atomics, then, no, that’s never going to be the case.

@ronag
Copy link
Member Author

ronag commented May 23, 2021

f you are in a situation where you can use Atomics.wait, then that’s not a bad idea. We do that in piscina as well – use Atomics.wait in combination with worker_threads.receiveMessageOnPort, which helps us skip the overhead of the callback (including async tracking) into JS on the receiving side.

The receiving side is not that big of a problem. I'm mostly concerned about the overhead on the main thread.

@ronag
Copy link
Member Author

ronag commented May 23, 2021

You can see that (expectedly) the V8 serialization and deserialization machinery does come with some overhead, and that calling into Node.js from C++ when a message is received also comes with quite some overhead.

Can the serialization overhead be reduced in someway? I would have hoped that passing string would have low serialization overhead (i.e. just a an malloc + memcpy). Does passing some form of typed/shared buffer have better perf?

@benjamingr
Copy link
Member

Does passing some form of typed/shared buffer have better perf?

Most likely - yes. I'd assume it also depends on the string itself (whether it's a tree or flattened already for example).

@addaleax
Copy link
Member

@ronag If you’re worried about the sending side… I don’t know, I guess we could create a fast path for people who only transfer strings/typed arrays that skips the serialization steps, and writes data a bit more directly. It is true that the V8 serializer is adding somewhat undue overhead in that case. This would definitely make the code quite a bit more complex, though,

We can also see if we can avoid std::shared_ptr<Message> in favor of std::unique_ptr<Message> for the non-broadcast cases, that might also help a bit (but again, make the code more complex).

@jasnell
Copy link
Member

jasnell commented May 23, 2021

We could make the group_mutex_ of SiblingGroup a rwlock instead of a plain mutex...

That's probably just a good idea in general.

For example, the v8::Global allocation in AsyncHooks::push_async_context accounts for 2 % of the benchmark here – if we avoid that and use v8::Local when we can and v8::Global only when we need to, we could save that for all async calls from C++ into JS, which would be great, but also make the async resource stack implementation even more complicated than it is.

Yeah I was looking at that a while back and decided against doing anything precisely because of the additional complexity. What I kept coming back to is the idea that maybe what would be most helpful is actually teaching v8 how to handle (and optimize) the async context itself so that we wouldn't necessarily have to allocate anything additional per async call into JS. But that's a larger discussion than just postMessage.

Fiddling around with a few percentage points is not going to make a huge difference, no, but I do think it's worthwhile where it's not too much effort.

addaleax added a commit to addaleax/node that referenced this issue May 23, 2021
Since it is much more common to send messages than to add or
remove ports from a sibling group, using a rwlock is appropriate here.

Refs: nodejs#38780 (comment)
addaleax added a commit to addaleax/node that referenced this issue May 23, 2021
@ronag
Copy link
Member Author

ronag commented May 23, 2021

@ronag If you’re worried about the sending side… I don’t know, I guess we could create a fast path for people who only transfer strings/typed arrays that skips the serialization steps, and writes data a bit more directly. It is true that the V8 serializer is adding somewhat undue overhead in that case. This would definitely make the code quite a bit more complex, though,

We can also see if we can avoid std::shared_ptr<Message> in favor of std::unique_ptr<Message> for the non-broadcast cases, that might also help a bit (but again, make the code more complex).

I think doing a fast path for buffer and string would make sense.

@ronag
Copy link
Member Author

ronag commented May 23, 2021

We can also see if we can avoid std::shared_ptr<Message> in favor of std::unique_ptr<Message> for the non-broadcast cases, that might also help a bit (but again, make the code more complex).

A little of topic but looking at the source code I'm curious as to why we would even need a unique_ptr? Can't we just make Message movable and use move semantics? The extra allocation seems unnecessary.

Also most of the lists in Message are probably quite short and we could probably get away with using a stack based allocator for std::vector that fallbacks to dynamic allocations when size outgrows initial capacity (a bit like how std::string works behind the scenes).

Also most of the fields in Message are unused most of the time so we could alternatively lower the overhead a bit by allocating them lazily.

struct Data {
  std::vector<std::shared_ptr<v8::BackingStore>> array_buffers_;
  std::vector<std::shared_ptr<v8::BackingStore>> shared_array_buffers_;
  std::vector<std::unique_ptr<TransferData>> transferables_;
  std::vector<v8::CompiledWasmModule> wasm_modules_;
};

MallocedBuffer<char> main_message_buf_;
std::optional<Data> data_; // or std::unique_ptr<Data> data_;

@ronag
Copy link
Member Author

ronag commented May 23, 2021

Using flattened strings seems to be significantly faster.

@addaleax
Copy link
Member

We can also see if we can avoid std::shared_ptr<Message> in favor of std::unique_ptr<Message> for the non-broadcast cases, that might also help a bit (but again, make the code more complex).

A little of topic but looking at the source code I'm curious as to why we would even need a unique_ptr? Can't we just make Message movable and use move semantics? The extra allocation seems unnecessary.

Well, yes, but we do want std::shared_ptr when the message actually has multiple recipients (rather than duplicating the mesage). Fwiw, for most of the comments you mentioned, std::variant is really the proper solution here, hence #38788.

Using flattened strings seems to be significantly faster.

I don’t think that’s a surprise, although I would expect that serializing them is also a flattening operation.

@ronag
Copy link
Member Author

ronag commented May 24, 2021

I don’t think that’s a surprise, although I would expect that serializing them is also a flattening operation.

Yes, but in my case I was doing something like:

worker.postMessage(`ADD/REMOVE_${data}`)

where data is a cached string that has been flattened with flatstr.

However, by doing some magic I managed to change it to:

worker.postMessage(data) // TOGGLE ON/OFF instead off ADD/REMOVE and hope we don't have any state bugs.

It's a bit unfortunate that I don't have a performant way to send meta with a payload.

Before this I also did:

worker.postMessage({ data, type: 'add/remove' })

Improving this could also be relavant for piscina. I guess the most common use for piscina it to offload the main thread.

jasnell pushed a commit that referenced this issue May 25, 2021
Since it is much more common to send messages than to add or
remove ports from a sibling group, using a rwlock is appropriate here.

Refs: #38780 (comment)

PR-URL: #38783
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Joyee Cheung <[email protected]>
danielleadams pushed a commit that referenced this issue May 31, 2021
Since it is much more common to send messages than to add or
remove ports from a sibling group, using a rwlock is appropriate here.

Refs: #38780 (comment)

PR-URL: #38783
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Joyee Cheung <[email protected]>
aduh95 pushed a commit to addaleax/node that referenced this issue Jun 2, 2021
Refs: nodejs#38780 (comment)

PR-URL: nodejs#38784
Reviewed-By: Gireesh Punathil <[email protected]>
Reviewed-By: Michaël Zasso <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: Minwoo Jung <[email protected]>
targos pushed a commit that referenced this issue Jun 11, 2021
Refs: #38780 (comment)

PR-URL: #38784
Reviewed-By: Gireesh Punathil <[email protected]>
Reviewed-By: Michaël Zasso <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: Minwoo Jung <[email protected]>
richardlau pushed a commit that referenced this issue Jul 16, 2021
Refs: #38780 (comment)

PR-URL: #38784
Reviewed-By: Gireesh Punathil <[email protected]>
Reviewed-By: Michaël Zasso <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: Minwoo Jung <[email protected]>
richardlau pushed a commit that referenced this issue Jul 19, 2021
Refs: #38780 (comment)

PR-URL: #38784
Reviewed-By: Gireesh Punathil <[email protected]>
Reviewed-By: Michaël Zasso <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: Minwoo Jung <[email protected]>
richardlau pushed a commit that referenced this issue Jul 20, 2021
Refs: #38780 (comment)

PR-URL: #38784
Reviewed-By: Gireesh Punathil <[email protected]>
Reviewed-By: Michaël Zasso <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: Minwoo Jung <[email protected]>
foxxyz pushed a commit to foxxyz/node that referenced this issue Oct 18, 2021
Refs: nodejs#38780 (comment)

PR-URL: nodejs#38784
Reviewed-By: Gireesh Punathil <[email protected]>
Reviewed-By: Michaël Zasso <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: Minwoo Jung <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Issues and PRs related to the performance of Node.js. question Issues that look for answers. worker Issues and PRs related to Worker support.
Projects
None yet
Development

No branches or pull requests

5 participants