Dropped WebSocket messages due to race condition in WebSocket frame handling #11081

darnap · 2023-12-18T17:11:30Z

Jetty version(s)
11.0.18

Java version/vendor (use: java -version)
11.0.17

OS type/version
Windows 11/Linux CentOS 8

Description
After migrating a CometD-based application from Jetty 9 to Jetty 11, we started finding that some automated tests started randomly failing when using WebSockets. We traced the issue to some CometD messages being lost/never delivered to the server.

Further analysis showed that the content of multiple WebSocket frames ends up packed into a single onMessage event on the application side, which is unexpected. Indeed, CometD expects multiple messages to be delivered as an array, not as back-to-back objects in a single message, so it only parses the first one found, while all subsequent ones are discarded.

Annotated application logs are attached to show that the same Utf8StringBuilder instance in StringMessageSink is used by 2 separate threads even though both are delivering a FIN frame, which should case immediate delivery to the application and a builder reset.
The logs were obtained using the distributed Jetty 11.0.18 with only the following modification to the org.eclipse.jetty.websocket.core.internal.messages.StringMessageSink class to add debugging traces:

// StringMessageSink.java
@@ -14,17 +14,22 @@
 package org.eclipse.jetty.websocket.core.internal.messages;
 
 import java.lang.invoke.MethodHandle;
+import java.util.Objects;
 
 import org.eclipse.jetty.util.Callback;
 import org.eclipse.jetty.util.Utf8StringBuilder;
 import org.eclipse.jetty.websocket.core.CoreSession;
 import org.eclipse.jetty.websocket.core.Frame;
 import org.eclipse.jetty.websocket.core.exception.MessageTooLargeException;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
 
 public class StringMessageSink extends AbstractMessageSink
 {
     private Utf8StringBuilder out;
     private int size;
+    
+    private static Logger LOG = LoggerFactory.getLogger(StringMessageSink.class);
 
     public StringMessageSink(CoreSession session, MethodHandle methodHandle)
     {
@@ -46,7 +51,14 @@
             }
 
             if (out == null)
-                out = new Utf8StringBuilder(session.getInputBufferSize());
+            {
+                out = new Utf8StringBuilder(session.getInputBufferSize());                
+                LOG.info("NEW Utf8StringBuilder {} created", System.identityHashCode(out));
+            }
+            else
+            {
+                LOG.warn("Utf8StringBuilder {} reused with {} bytes for {} with FIN {}", System.identityHashCode(out), out.length(), callback, frame.isFin());
+            }
 
             out.append(frame.getPayload());
             if (frame.isFin())

How to reproduce?
Not reproducible systematically.

analysis on possibile race condition on StringMessageSink.txt

The text was updated successfully, but these errors were encountered:

Signed-off-by: Lachlan Roberts <[email protected]>

joakime · 2023-12-19T13:39:48Z

Opened PR #11084

Issue #11081 - fix race condition in WebSocket FrameHandlers

lachlan-roberts · 2023-12-20T01:08:58Z

@darnap we merged a PR which should fix this.
If you want you can build from the 11.0.x branch and confirm this fixed it for you.

sbordet · 2023-12-20T09:22:45Z

@darnap from your analysis, I can see you have a custom CometD WebSocketTransport called AppWebSocketTransport.

Can you detail if it is based on the Jetty APIs or the standard Jakarta APIs?

Also, what are the reasons for using a custom transport?
I ask because perhaps we can accommodate the features in the CometD transports, rather than you having to write your own.

darnap · 2023-12-20T10:35:51Z

@lachlan-roberts Thanks for the prompt fix. We will re-run tests with this change and see what happens. Would it be enough to cherry-pick just this change onto the 11.0.18 tag of websocket-jetty-common? This would simplify our test deployment. Just to understand how the fix is supposed to work: what prevents the same race from occurring since onTextFrame() simply resets the activeMessageSink to the same textSink instance if it's null?

@sbordet The custom transport is based on the Jetty APIs. We needed to override the transport and Endpoint in order to:

Collect client timing statistics and implement keep-alive logic (using ping-pong frames) to avoid timeouts on strict proxies without relying on client-side logic.
Collect data transfer statistics: actual bytes on the wire and time from enqueuing to sending, to detect client/network congestion.
Override the serialization process for messages and replies. We have several messages that already contain an UTF8 representation so we avoid the overhead of converting it back to String only to have it serialized again into UTF8. We use a JsonGenerator to write the raw JSON to a ByteBuffer, then send it as a TEXT frame directly.
Use pooling for ByteBuffers used in serialization
Prioritize Reply messages to avoid cases in which we had client-side timeouts due to large message payloads.
Apply back-pressure on the server side when we detect that the client is congested, using writeComplete to keep track of messages waiting to be sent. (We find this is more efficient than the ACK extension, as it avoids an extra RTT and we keep fewer pending messages in memory).

If there's any better way to accomplish these goals we'd gladly avoid using a custom transport.

sbordet · 2023-12-20T10:55:25Z

Would it be enough to cherry-pick just this change onto the 11.0.18 tag of websocket-jetty-common?

Yes it would be enough.

what prevents the same race from occurring since onTextFrame() simply resets the activeMessageSink to the same textSink instance if it's null?

Good point.

@lachlan-roberts I think the problem is the finally block, not the nulling of the activeMessageSink.

Thread T1 in StringMessageSink.accept() does callback.succeded() + session.demand(1), which means another thread T2 can enter StringMessageSink.accept() before T1 executes the finally block.

I'm surprised there are no NPEs!

ByteBufferMessageSink and ByteArrayMessageSink have the same (wrong) finally pattern.

The demand should be after the nulling of the out field.

…bSocket frame handling. Now the reset of the MessageSink internal accumulators happens before the demand. This avoids the race, since as soon as there is demand another thread could enter the MessageSink, but the accumulator has already been reset. Signed-off-by: Simone Bordet <[email protected]>

sbordet · 2023-12-20T11:35:00Z

@darnap would you be able to try #11090?

darnap · 2023-12-20T11:38:56Z

@sbordet I will, thanks.

darnap · 2023-12-21T16:35:41Z

@sbordet The issue did not occur again in the last run of tests after building with this change in place. Hopefully it's solved, thanks.

Issue #11081 - fix race condition in WebSocket FrameHandlers (jetty-12)

…bSocket frame handling. (#11090) Now the reset of the MessageSink internal accumulators happens before the demand. This avoids the race, since as soon as there is demand another thread could enter the MessageSink, but the accumulator has already been reset. Signed-off-by: Simone Bordet <[email protected]>

darnap added the Bug For general bugs on Jetty side label Dec 18, 2023

joakime assigned lachlan-roberts Dec 18, 2023

joakime added this to Jetty 10.0.20 / 11.0.20 FROZEN (Post EOCD Release) Dec 18, 2023

joakime modified the milestones: 10.0.x, 11.0.x Dec 18, 2023

lachlan-roberts added a commit that referenced this issue Dec 19, 2023

Issue #11081 - fix race condition for WebSocket FrameHandlers

7b3029d

Signed-off-by: Lachlan Roberts <[email protected]>

lachlan-roberts added a commit that referenced this issue Dec 19, 2023

Issue #11081 - fix race condition in WebSocket FrameHandlers

e69f1d2

Signed-off-by: Lachlan Roberts <[email protected]>

joakime linked a pull request Dec 19, 2023 that will close this issue

Issue #11081 - fix race condition in WebSocket FrameHandlers #11084

Merged

lachlan-roberts closed this as completed in #11084 Dec 20, 2023

github-project-automation bot moved this to ✅ Done in Jetty 10.0.20 / 11.0.20 FROZEN (Post EOCD Release) Dec 20, 2023

lachlan-roberts added a commit that referenced this issue Dec 20, 2023

Merge pull request #11084 from jetty/jetty-10.0.x-11081-websocketRace

1fb3f31

Issue #11081 - fix race condition in WebSocket FrameHandlers

sbordet reopened this Dec 20, 2023

sbordet mentioned this issue Dec 20, 2023

Issue #11081 - fix race condition in WebSocket FrameHandlers (jetty-12) #11086

Merged

sbordet linked a pull request Dec 20, 2023 that will close this issue

Fixes #11081 - Dropped WebSocket messages due to race condition in WebSocket frame handling. #11090

Merged

sbordet mentioned this issue Dec 20, 2023

WebSocket transport improvements cometd/cometd#1585

Open

sbordet added a commit that referenced this issue Dec 29, 2023

Merge pull request #11086 from jetty/jetty-12.0.x-11081-websocketRace

2505d0e

Issue #11081 - fix race condition in WebSocket FrameHandlers (jetty-12)

sbordet closed this as completed in #11090 Jan 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dropped WebSocket messages due to race condition in WebSocket frame handling #11081

Dropped WebSocket messages due to race condition in WebSocket frame handling #11081

darnap commented Dec 18, 2023 •

edited by joakime

Loading

joakime commented Dec 19, 2023

lachlan-roberts commented Dec 20, 2023

sbordet commented Dec 20, 2023

darnap commented Dec 20, 2023

sbordet commented Dec 20, 2023

sbordet commented Dec 20, 2023

darnap commented Dec 20, 2023

darnap commented Dec 21, 2023

Dropped WebSocket messages due to race condition in WebSocket frame handling #11081

Dropped WebSocket messages due to race condition in WebSocket frame handling #11081

Comments

darnap commented Dec 18, 2023 • edited by joakime Loading

joakime commented Dec 19, 2023

lachlan-roberts commented Dec 20, 2023

sbordet commented Dec 20, 2023

darnap commented Dec 20, 2023

sbordet commented Dec 20, 2023

sbordet commented Dec 20, 2023

darnap commented Dec 20, 2023

darnap commented Dec 21, 2023

darnap commented Dec 18, 2023 •

edited by joakime

Loading