WebSockets Next: performance improvements #39148

mkouba · 2024-03-04T11:00:07Z

Description

Follow-up of #39142.

Implementation ideas

use an AtomicLongFieldUpdater in the ConcurrencyLimiter: Initial version of the new declarative WebSocket server API #39142 (comment)

The text was updated successfully, but these errors were encountered:

mkouba · 2024-04-23T10:17:43Z

The #40183 is related to this issue.

mkouba · 2024-06-25T12:26:22Z

For this one, we'll need some benchmarks (executed automatically). Ideally, to compare WS next with quarkus-websockets and pure Vert.x.

CC @franz1981

franz1981 · 2024-06-25T12:33:39Z

I'm not aware of websocket benchmark suites sadly...

mkouba · 2024-06-25T13:09:01Z

I'm not aware of websocket benchmark suites sadly...

Me neither. We'll need to come with something... ;-)

franz1981 · 2024-06-25T13:58:21Z

Let's talk this week in a call and we can think about something

mkouba · 2024-06-27T06:36:46Z

NOTE: This extension contains a bunch of lambdas. We should consider rewriting those lambdas to anonymous classes.

cescoffier · 2024-10-02T05:24:24Z

We need to think about scenarios to test the performances. Response time is not a meaningful metric. The number of messages and connections are more sensitive in this case. (Of course, memory is important too).

franz1981 · 2024-10-02T05:42:07Z

Yeah, although if you can achieve 10K msg/sec with a single outlier with 10 seconds of latency - is something you wish to know.

I have contacted the Jetty team (Simone Bordet) - and they have rolled out a coordination omission free distributed (if required) load generator for websocket; let's see what we can do in Hyperfoil or by reusing such, which is used for this exact purpose

cescoffier · 2024-10-02T08:01:40Z

oh, that's would we good!

franz1981 · 2024-10-02T08:05:47Z

And clearly it is not covering websocket, see https://github.com/jetty-project/jetty-load-generator :"(

Which means that we should prioritize supporting websockets for Hyperfoil or find a different benchmarking tool (which is coordinated-omission free - not easy)

mkouba · 2024-10-02T15:26:23Z

Which means that we should prioritize supporting websockets for Hyperfoil or find a different benchmarking tool (which is coordinated-omission free - not easy)

For load tests where we don't care about coordinated-omission and throughput, we could try to use the Gatling WebSocket protocol or even a simple Vert.x client.

cescoffier · 2024-10-02T15:57:37Z

I used Gatling in the past.

mkouba · 2024-10-07T08:33:33Z

So I started to play with a simple Vertx client, Gatling, etc. here: https://github.com/mkouba/ws-next-perf.

And it seems that at moderate load the performance of quarkus-websockets and quarkus-websockets-next is more or less the same. However under heavy load (in my test it was 10.000 concurrent users sending 1000 and receiving 1000 messages) the performance of quarkus-websockets degrades significantly. I did some CPU profiling and I didn't find anything obvious in the WS next code.

Apparently the biggest problem is switching to a worker thread because the tested @OnTextMessage callback has a blocking signature. If we switch to Uni<String> (i.e. callback executed on the event loop) then the performance is significantly better but still not better than the legacy extension. However, the blocking signature is probably what most users will use anyway...

@franz1981 Could you pls take a look at the attached flamegraph?

cpu-profile.html.zip

franz1981 · 2024-10-07T08:52:30Z

I see there's a fun problem with synchronizers, which can impact pretty bad scalability and perf (and RSS because "inflated" monitors increase RSS on Hotspot) i.e. io/vertx/core/http/impl/ServerWebSocketImpl.tryHandshake (and io/vertx/core/net/impl/ConnectionBase.queueForWrite as well) which is protecting the handshake via a synchronized guard.
You can confirm this by collecting profiling data using -e lock -t (add t as well to see which threads are competing to enter in the lock).
The suggestion here is to have less worker threads which compete among each other(s), but they will likely still compete vs the I/O threads (or not - we need the profiling via -e lock -t).
I believe the check performed on io/vertx/core/http/impl/ServerWebSocketImpl.tryHandshake can be improved via some volatile guard - and avoid it - and we will fly.
I can create a specific vertx microbenchmark for this in vertx itself

important note:
I'm at devoxx and I didn't yet look at the bench itself but 2 things after looking at the data:

collect profoling data after waiting a bit before warming up of the application complete: I can see C2 frames meaning that compilation is still going on (after ~10K or less invocations thing will smooth out)
it looks that it is intentionally a cpu bound computation: is it what we expect? I would, instead, add a parametrized fake blocking call (Thread::sleep(configuredFakeBlockingWork) to perform some really blocking behaviour, when we run things on the worker thread pool - this will make more realistic.

The last point is very key to understand that if users are making use of the worker thread pool they are supposed to perform blocking operations (in the form of 10/100 ms work each), this will guarantee 2 effects:

less contention over synchronized part - likely
less "oversubscription" of worker threads vs the available cores i.e. worker threads pool have Math::min(cores * 8, 200) threads - and it means they need to interleave to make progress..if you make them cpu bound performing little non-blocking work, this can stress a LOT other OS mechanism which won't be stressed in the real world

As usual, I love you're so proactive and quick to react @mkouba thanks again for taking both time and effort for the test + collecting data: this will make so much easier for me to help! ❤

franz1981 · 2024-10-07T08:58:15Z

In addition; this is another low hanging fruit I can help with:

https://github.com/eclipse-vertx/vert.x/blob/916ae9911dbb2a8cf818eee6b5390f62f37fce00/vertx-core/src/main/java/io/vertx/core/http/impl/WebSocketImplBase.java#L478-L493

But gotta better check: I'm adding this note for myself of the future

mkouba · 2024-10-07T09:38:07Z

I see there's a fun problem with synchronizers, which can impact pretty bad scalability and perf (and RSS because "inflated" monitors increase RSS on Hotspot) i.e. io/vertx/core/http/impl/ServerWebSocketImpl.tryHandshake (and io/vertx/core/net/impl/ConnectionBase.queueForWrite as well) which is protecting the handshake via a synchronized guard.

Yes, I noticed this part as well.

I can create a specific vertx microbenchmark for this in vertx itself

That would be great.

collect profoling data after waiting a bit before warming up of the application complete: I can see C2 frames meaning that compilation is still going on (after ~10K or less invocations thing will smooth out)

cpu-profile.html_02.zip

2. it looks that it is intentionally a cpu bound computation: is it what we expect? I would, instead, add a parametrized fake blocking call (Thread::sleep(configuredFakeBlockingWork) to perform some really blocking behaviour, when we run things on the worker thread pool - this will make more realistic.

It depends. I don't think that all callbacks with a blocking signature will execute code that would block the thread. But for sure, we need more scenarios that would cover all common use cases. Currently, we only call String.toLowerCase() 🤷 .

Thanks Franz!

mkouba · 2024-10-08T06:44:14Z

FYI I've just noticed the following sentence in the javadoc of io.vertx.core.http.impl.ServerWebSocketImpl: "This class is optimised for performance when used on the same event loop. However it can be used safely from other threads.".

And also "The internal state is protected using the synchronized keyword. If always used on the same event loop, then we benefit from biased locking which makes the overhead of synchronized near zero.".

So obviously, it's not optimized for the blocking use case ;-).

franz1981 · 2024-10-09T00:40:53Z

@mkouba Yep and ideally this could be improved on Vertx 5, but there is still some low hanging fruit on Vertx 4 - which we can easily explored if is worthy i.e. franz1981/vert.x@3ca72f8
If you want to try this or apply this commit to the right vertx branch you could give it a shot in your benchmark

what is doing is fairly simple, and is based on the analysis I've performed for https://github.com/franz1981/java-puzzles/blob/583d468a58a6ecaa5e7c7c300895392638f688dd/src/main/java/red/hat/puzzles/concurrent/LockCoarsening.java#L76-L85 which is the motivation behind the vertx 5 changes in this regard.

vietj · 2024-10-09T07:15:59Z

FYI : this part in Vertx 5 has been rewritten, so this analysis does not hold for it

mkouba · 2024-10-09T07:17:33Z

If you want to try this or apply this commit to the right vertx branch you could give it a shot in your benchmark

Unfortunately, it does not seem to be an easy task to switch the vertx-core version used in Quarkus. You cannot simply change the vertx.version in the BOM because it's used for other Vert.x dependencies (vertx-web, etc.). You cannot set an explicit version the quarkus-vertx runtime because you get dependency convergence errors.

@cescoffier @vietj Any tip how to try this out?

mkouba · 2024-10-09T07:19:44Z

FYI : this part in Vertx 5 has been rewritten, so this analysis does not hold for it

Hey Julien, do you have some benchmarks in Vert.x to test the performance of WebSockets server/client?

franz1981 · 2024-10-09T07:40:12Z

Unfortunately, it does not seem to be an easy task to switch the vertx-core version used in Quarkus.

What I would do is to cherry pick the commit to the right vertx tag, use mvn install and either replace the jar in the lib of quarkus or hope that the local mvn repo will do the right thing(TM)

I have found another good improvement to fix the buffer copies too - which I can send to vertx 5 regardless

mkouba · 2024-10-09T08:22:06Z

Unfortunately, it does not seem to be an easy task to switch the vertx-core version used in Quarkus.

What I would do is to cherry pick the commit to the right vertx tag, use mvn install and either replace the jar in the lib of quarkus or hope that the local mvn repo will do the right thing(TM)

Ah, ofc. This worked. And quick and dirty results seem to be much better, comparable to quarkus-websockets.

franz1981 · 2024-10-09T08:43:12Z

@mkouba ok so this seems a painless change if @vietj and @cescoffier agreed and you see benefits.
I spent some time analysing the weird synchronised behaviour with the vertx code pattern so, sadly, these "workarounds" can be very effective

cescoffier · 2024-10-09T08:50:31Z

Do you have a link to the commit to cherry-pick?

mkouba · 2024-10-09T08:54:46Z

Do you have a link to the commit to cherry-pick?

@cescoffier franz1981/vert.x@3ca72f8

cescoffier · 2024-10-09T08:58:50Z

The commit looks good. It avoids entering synchronized blocks.

I'm not sure of the various assertions.

Let's see what @vietj says.

mkouba · 2024-10-09T09:04:17Z

The committee looks good. It avoids entering synchronized blocks.

I'm not sure of the various assertions.

Let's see what @vietj says.

@cescoffier What committee? 😆

franz1981 · 2024-10-09T09:04:50Z

Yep @cescoffier the checks on asserts should be enabled on both quarkus and vertx maven surefire tests to make use the new methods are not misused while still not impacting the hot path at runtime (asserts are fully removed)

franz1981 · 2024-10-10T11:58:14Z

I have created franz1981/vert.x@9a0f516 to fix the buffer problem saw few comments earlier too

mkouba · 2024-10-15T14:19:29Z

FYI I'm working on a pull request to disable CDI request context activation for endpoint callbacks unless really needed, i.e. an endpoint has a @RequestScoped dependency or is secured.

- related to quarkusio#39148

mkouba added kind/enhancement New feature or request area/websockets labels Mar 4, 2024

mkouba mentioned this issue Mar 4, 2024

Initial version of the new declarative WebSocket server API #39142

Merged

mkouba mentioned this issue Jun 10, 2024

WebSockets Next: cancel returned Multi if the connection is closed #41044

Merged

mkouba mentioned this issue Jun 27, 2024

WebSockets Next: get rid of UniHelper#toUni() #41467

Merged

franz1981 mentioned this issue Oct 10, 2024

Save recursive locks on websocket writeFrame eclipse-vertx/vert.x#5348

Merged

mkouba added a commit to mkouba/quarkus that referenced this issue Oct 16, 2024

WebSockets Next: activate CDI request context only if needed

14ecb73

- related to quarkusio#39148

mkouba added a commit to mkouba/quarkus that referenced this issue Oct 16, 2024

WebSockets Next: activate CDI request context only if needed

744238e

- related to quarkusio#39148

mkouba mentioned this issue Oct 16, 2024

WebSockets Next: activate CDI request context only if needed #43915

Merged

mkouba added a commit to mkouba/quarkus that referenced this issue Oct 16, 2024

WebSockets Next: activate CDI request context only if needed

38147c1

- related to quarkusio#39148

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WebSockets Next: performance improvements #39148

WebSockets Next: performance improvements #39148

mkouba commented Mar 4, 2024

mkouba commented Apr 23, 2024

mkouba commented Jun 25, 2024

franz1981 commented Jun 25, 2024

mkouba commented Jun 25, 2024

franz1981 commented Jun 25, 2024

mkouba commented Jun 27, 2024

cescoffier commented Oct 2, 2024

franz1981 commented Oct 2, 2024 •

edited

Loading

cescoffier commented Oct 2, 2024

franz1981 commented Oct 2, 2024 •

edited

Loading

mkouba commented Oct 2, 2024

cescoffier commented Oct 2, 2024

mkouba commented Oct 7, 2024

franz1981 commented Oct 7, 2024 •

edited

Loading

franz1981 commented Oct 7, 2024

mkouba commented Oct 7, 2024

mkouba commented Oct 8, 2024

franz1981 commented Oct 9, 2024 •

edited

Loading

vietj commented Oct 9, 2024

mkouba commented Oct 9, 2024

mkouba commented Oct 9, 2024

franz1981 commented Oct 9, 2024 •

edited

Loading

mkouba commented Oct 9, 2024

franz1981 commented Oct 9, 2024

cescoffier commented Oct 9, 2024

mkouba commented Oct 9, 2024

cescoffier commented Oct 9, 2024 •

edited

Loading

mkouba commented Oct 9, 2024

franz1981 commented Oct 9, 2024 •

edited

Loading

franz1981 commented Oct 10, 2024 •

edited

Loading

mkouba commented Oct 15, 2024

WebSockets Next: performance improvements #39148

WebSockets Next: performance improvements #39148

Comments

mkouba commented Mar 4, 2024

Description

Implementation ideas

mkouba commented Apr 23, 2024

mkouba commented Jun 25, 2024

franz1981 commented Jun 25, 2024

mkouba commented Jun 25, 2024

franz1981 commented Jun 25, 2024

mkouba commented Jun 27, 2024

cescoffier commented Oct 2, 2024

franz1981 commented Oct 2, 2024 • edited Loading

cescoffier commented Oct 2, 2024

franz1981 commented Oct 2, 2024 • edited Loading

mkouba commented Oct 2, 2024

cescoffier commented Oct 2, 2024

mkouba commented Oct 7, 2024

franz1981 commented Oct 7, 2024 • edited Loading

franz1981 commented Oct 7, 2024

mkouba commented Oct 7, 2024

mkouba commented Oct 8, 2024

franz1981 commented Oct 9, 2024 • edited Loading

vietj commented Oct 9, 2024

mkouba commented Oct 9, 2024

mkouba commented Oct 9, 2024

franz1981 commented Oct 9, 2024 • edited Loading

mkouba commented Oct 9, 2024

franz1981 commented Oct 9, 2024

cescoffier commented Oct 9, 2024

mkouba commented Oct 9, 2024

cescoffier commented Oct 9, 2024 • edited Loading

mkouba commented Oct 9, 2024

franz1981 commented Oct 9, 2024 • edited Loading

franz1981 commented Oct 10, 2024 • edited Loading

mkouba commented Oct 15, 2024

franz1981 commented Oct 2, 2024 •

edited

Loading

franz1981 commented Oct 2, 2024 •

edited

Loading

franz1981 commented Oct 7, 2024 •

edited

Loading

franz1981 commented Oct 9, 2024 •

edited

Loading

franz1981 commented Oct 9, 2024 •

edited

Loading

cescoffier commented Oct 9, 2024 •

edited

Loading

franz1981 commented Oct 9, 2024 •

edited

Loading

franz1981 commented Oct 10, 2024 •

edited

Loading