Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WebSockets Next: performance improvements #39148

Open
mkouba opened this issue Mar 4, 2024 · 31 comments
Open

WebSockets Next: performance improvements #39148

mkouba opened this issue Mar 4, 2024 · 31 comments
Labels

Comments

@mkouba
Copy link
Contributor

mkouba commented Mar 4, 2024

Description

Follow-up of #39142.

Implementation ideas

@mkouba
Copy link
Contributor Author

mkouba commented Apr 23, 2024

The #40183 is related to this issue.

@mkouba
Copy link
Contributor Author

mkouba commented Jun 25, 2024

For this one, we'll need some benchmarks (executed automatically). Ideally, to compare WS next with quarkus-websockets and pure Vert.x.

CC @franz1981

@franz1981
Copy link
Contributor

I'm not aware of websocket benchmark suites sadly...

@mkouba
Copy link
Contributor Author

mkouba commented Jun 25, 2024

I'm not aware of websocket benchmark suites sadly...

Me neither. We'll need to come with something... ;-)

@franz1981
Copy link
Contributor

Let's talk this week in a call and we can think about something

@mkouba
Copy link
Contributor Author

mkouba commented Jun 27, 2024

NOTE: This extension contains a bunch of lambdas. We should consider rewriting those lambdas to anonymous classes.

@cescoffier
Copy link
Member

We need to think about scenarios to test the performances. Response time is not a meaningful metric. The number of messages and connections are more sensitive in this case. (Of course, memory is important too).

@franz1981
Copy link
Contributor

franz1981 commented Oct 2, 2024

Yeah, although if you can achieve 10K msg/sec with a single outlier with 10 seconds of latency - is something you wish to know.

I have contacted the Jetty team (Simone Bordet) - and they have rolled out a coordination omission free distributed (if required) load generator for websocket; let's see what we can do in Hyperfoil or by reusing such, which is used for this exact purpose

@cescoffier
Copy link
Member

oh, that's would we good!

@franz1981
Copy link
Contributor

franz1981 commented Oct 2, 2024

And clearly it is not covering websocket, see https://github.com/jetty-project/jetty-load-generator :"(

Which means that we should prioritize supporting websockets for Hyperfoil or find a different benchmarking tool (which is coordinated-omission free - not easy)

@mkouba
Copy link
Contributor Author

mkouba commented Oct 2, 2024

Which means that we should prioritize supporting websockets for Hyperfoil or find a different benchmarking tool (which is coordinated-omission free - not easy)

For load tests where we don't care about coordinated-omission and throughput, we could try to use the Gatling WebSocket protocol or even a simple Vert.x client.

@cescoffier
Copy link
Member

I used Gatling in the past.

@mkouba
Copy link
Contributor Author

mkouba commented Oct 7, 2024

So I started to play with a simple Vertx client, Gatling, etc. here: https://github.com/mkouba/ws-next-perf.

And it seems that at moderate load the performance of quarkus-websockets and quarkus-websockets-next is more or less the same. However under heavy load (in my test it was 10.000 concurrent users sending 1000 and receiving 1000 messages) the performance of quarkus-websockets degrades significantly. I did some CPU profiling and I didn't find anything obvious in the WS next code.

Apparently the biggest problem is switching to a worker thread because the tested @OnTextMessage callback has a blocking signature. If we switch to Uni<String> (i.e. callback executed on the event loop) then the performance is significantly better but still not better than the legacy extension. However, the blocking signature is probably what most users will use anyway...

@franz1981 Could you pls take a look at the attached flamegraph?

cpu-profile.html.zip

@franz1981
Copy link
Contributor

franz1981 commented Oct 7, 2024

I see there's a fun problem with synchronizers, which can impact pretty bad scalability and perf (and RSS because "inflated" monitors increase RSS on Hotspot) i.e. io/vertx/core/http/impl/ServerWebSocketImpl.tryHandshake (and io/vertx/core/net/impl/ConnectionBase.queueForWrite as well) which is protecting the handshake via a synchronized guard.
You can confirm this by collecting profiling data using -e lock -t (add t as well to see which threads are competing to enter in the lock).
The suggestion here is to have less worker threads which compete among each other(s), but they will likely still compete vs the I/O threads (or not - we need the profiling via -e lock -t).
I believe the check performed on io/vertx/core/http/impl/ServerWebSocketImpl.tryHandshake can be improved via some volatile guard - and avoid it - and we will fly.
I can create a specific vertx microbenchmark for this in vertx itself

important note:
I'm at devoxx and I didn't yet look at the bench itself but 2 things after looking at the data:

  1. collect profoling data after waiting a bit before warming up of the application complete: I can see C2 frames meaning that compilation is still going on (after ~10K or less invocations thing will smooth out)
  2. it looks that it is intentionally a cpu bound computation: is it what we expect? I would, instead, add a parametrized fake blocking call (Thread::sleep(configuredFakeBlockingWork) to perform some really blocking behaviour, when we run things on the worker thread pool - this will make more realistic.

The last point is very key to understand that if users are making use of the worker thread pool they are supposed to perform blocking operations (in the form of 10/100 ms work each), this will guarantee 2 effects:

  1. less contention over synchronized part - likely
  2. less "oversubscription" of worker threads vs the available cores i.e. worker threads pool have Math::min(cores * 8, 200) threads - and it means they need to interleave to make progress..if you make them cpu bound performing little non-blocking work, this can stress a LOT other OS mechanism which won't be stressed in the real world

As usual, I love you're so proactive and quick to react @mkouba thanks again for taking both time and effort for the test + collecting data: this will make so much easier for me to help! ❤

@franz1981
Copy link
Contributor

In addition; this is another low hanging fruit I can help with:
Image

https://github.com/eclipse-vertx/vert.x/blob/916ae9911dbb2a8cf818eee6b5390f62f37fce00/vertx-core/src/main/java/io/vertx/core/http/impl/WebSocketImplBase.java#L478-L493

But gotta better check: I'm adding this note for myself of the future

@mkouba
Copy link
Contributor Author

mkouba commented Oct 7, 2024

I see there's a fun problem with synchronizers, which can impact pretty bad scalability and perf (and RSS because "inflated" monitors increase RSS on Hotspot) i.e. io/vertx/core/http/impl/ServerWebSocketImpl.tryHandshake (and io/vertx/core/net/impl/ConnectionBase.queueForWrite as well) which is protecting the handshake via a synchronized guard.

Yes, I noticed this part as well.

I can create a specific vertx microbenchmark for this in vertx itself

That would be great.

  1. collect profoling data after waiting a bit before warming up of the application complete: I can see C2 frames meaning that compilation is still going on (after ~10K or less invocations thing will smooth out)

cpu-profile.html_02.zip

2. it looks that it is intentionally a cpu bound computation: is it what we expect? I would, instead, add a parametrized fake blocking call (Thread::sleep(configuredFakeBlockingWork) to perform some really blocking behaviour, when we run things on the worker thread pool - this will make more realistic.

It depends. I don't think that all callbacks with a blocking signature will execute code that would block the thread. But for sure, we need more scenarios that would cover all common use cases. Currently, we only call String.toLowerCase() 🤷 .

Thanks Franz!

@mkouba
Copy link
Contributor Author

mkouba commented Oct 8, 2024

FYI I've just noticed the following sentence in the javadoc of io.vertx.core.http.impl.ServerWebSocketImpl: "This class is optimised for performance when used on the same event loop. However it can be used safely from other threads.".

And also "The internal state is protected using the synchronized keyword. If always used on the same event loop, then we benefit from biased locking which makes the overhead of synchronized near zero.".

So obviously, it's not optimized for the blocking use case ;-).

@franz1981
Copy link
Contributor

franz1981 commented Oct 9, 2024

@mkouba Yep and ideally this could be improved on Vertx 5, but there is still some low hanging fruit on Vertx 4 - which we can easily explored if is worthy i.e. franz1981/vert.x@3ca72f8
If you want to try this or apply this commit to the right vertx branch you could give it a shot in your benchmark

what is doing is fairly simple, and is based on the analysis I've performed for https://github.com/franz1981/java-puzzles/blob/583d468a58a6ecaa5e7c7c300895392638f688dd/src/main/java/red/hat/puzzles/concurrent/LockCoarsening.java#L76-L85 which is the motivation behind the vertx 5 changes in this regard.

@vietj
Copy link

vietj commented Oct 9, 2024

FYI : this part in Vertx 5 has been rewritten, so this analysis does not hold for it

@mkouba
Copy link
Contributor Author

mkouba commented Oct 9, 2024

If you want to try this or apply this commit to the right vertx branch you could give it a shot in your benchmark

Unfortunately, it does not seem to be an easy task to switch the vertx-core version used in Quarkus. You cannot simply change the vertx.version in the BOM because it's used for other Vert.x dependencies (vertx-web, etc.). You cannot set an explicit version the quarkus-vertx runtime because you get dependency convergence errors.

@cescoffier @vietj Any tip how to try this out?

@mkouba
Copy link
Contributor Author

mkouba commented Oct 9, 2024

FYI : this part in Vertx 5 has been rewritten, so this analysis does not hold for it

Hey Julien, do you have some benchmarks in Vert.x to test the performance of WebSockets server/client?

@franz1981
Copy link
Contributor

franz1981 commented Oct 9, 2024

Unfortunately, it does not seem to be an easy task to switch the vertx-core version used in Quarkus.

What I would do is to cherry pick the commit to the right vertx tag, use mvn install and either replace the jar in the lib of quarkus or hope that the local mvn repo will do the right thing(TM)

I have found another good improvement to fix the buffer copies too - which I can send to vertx 5 regardless

@mkouba
Copy link
Contributor Author

mkouba commented Oct 9, 2024

Unfortunately, it does not seem to be an easy task to switch the vertx-core version used in Quarkus.

What I would do is to cherry pick the commit to the right vertx tag, use mvn install and either replace the jar in the lib of quarkus or hope that the local mvn repo will do the right thing(TM)

Ah, ofc. This worked. And quick and dirty results seem to be much better, comparable to quarkus-websockets.

@franz1981
Copy link
Contributor

@mkouba ok so this seems a painless change if @vietj and @cescoffier agreed and you see benefits.
I spent some time analysing the weird synchronised behaviour with the vertx code pattern so, sadly, these "workarounds" can be very effective

@cescoffier
Copy link
Member

Do you have a link to the commit to cherry-pick?

@mkouba
Copy link
Contributor Author

mkouba commented Oct 9, 2024

Do you have a link to the commit to cherry-pick?

@cescoffier franz1981/vert.x@3ca72f8

@cescoffier
Copy link
Member

cescoffier commented Oct 9, 2024

The commit looks good. It avoids entering synchronized blocks.

I'm not sure of the various assertions.

Let's see what @vietj says.

@mkouba
Copy link
Contributor Author

mkouba commented Oct 9, 2024

The committee looks good. It avoids entering synchronized blocks.

I'm not sure of the various assertions.

Let's see what @vietj says.

@cescoffier What committee? 😆

@franz1981
Copy link
Contributor

franz1981 commented Oct 9, 2024

Yep @cescoffier the checks on asserts should be enabled on both quarkus and vertx maven surefire tests to make use the new methods are not misused while still not impacting the hot path at runtime (asserts are fully removed)

@franz1981
Copy link
Contributor

franz1981 commented Oct 10, 2024

I have created franz1981/vert.x@9a0f516 to fix the buffer problem saw few comments earlier too

@mkouba
Copy link
Contributor Author

mkouba commented Oct 15, 2024

FYI I'm working on a pull request to disable CDI request context activation for endpoint callbacks unless really needed, i.e. an endpoint has a @RequestScoped dependency or is secured.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: In Progress
Development

No branches or pull requests

4 participants