Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dropped WebSocket messages due to race condition in WebSocket frame handling #11081

Closed
darnap opened this issue Dec 18, 2023 · 8 comments · Fixed by #11084 or #11090
Closed

Dropped WebSocket messages due to race condition in WebSocket frame handling #11081

darnap opened this issue Dec 18, 2023 · 8 comments · Fixed by #11084 or #11090
Assignees
Labels
Bug For general bugs on Jetty side
Milestone

Comments

@darnap
Copy link

darnap commented Dec 18, 2023

Jetty version(s)
11.0.18

Java version/vendor (use: java -version)
11.0.17

OS type/version
Windows 11/Linux CentOS 8

Description
After migrating a CometD-based application from Jetty 9 to Jetty 11, we started finding that some automated tests started randomly failing when using WebSockets. We traced the issue to some CometD messages being lost/never delivered to the server.

Further analysis showed that the content of multiple WebSocket frames ends up packed into a single onMessage event on the application side, which is unexpected. Indeed, CometD expects multiple messages to be delivered as an array, not as back-to-back objects in a single message, so it only parses the first one found, while all subsequent ones are discarded.

Annotated application logs are attached to show that the same Utf8StringBuilder instance in StringMessageSink is used by 2 separate threads even though both are delivering a FIN frame, which should case immediate delivery to the application and a builder reset.
The logs were obtained using the distributed Jetty 11.0.18 with only the following modification to the org.eclipse.jetty.websocket.core.internal.messages.StringMessageSink class to add debugging traces:

// StringMessageSink.java
@@ -14,17 +14,22 @@
 package org.eclipse.jetty.websocket.core.internal.messages;
 
 import java.lang.invoke.MethodHandle;
+import java.util.Objects;
 
 import org.eclipse.jetty.util.Callback;
 import org.eclipse.jetty.util.Utf8StringBuilder;
 import org.eclipse.jetty.websocket.core.CoreSession;
 import org.eclipse.jetty.websocket.core.Frame;
 import org.eclipse.jetty.websocket.core.exception.MessageTooLargeException;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
 
 public class StringMessageSink extends AbstractMessageSink
 {
     private Utf8StringBuilder out;
     private int size;
+    
+    private static Logger LOG = LoggerFactory.getLogger(StringMessageSink.class);
 
     public StringMessageSink(CoreSession session, MethodHandle methodHandle)
     {
@@ -46,7 +51,14 @@
             }
 
             if (out == null)
-                out = new Utf8StringBuilder(session.getInputBufferSize());
+            {
+                out = new Utf8StringBuilder(session.getInputBufferSize());                
+                LOG.info("NEW Utf8StringBuilder {} created", System.identityHashCode(out));
+            }
+            else
+            {
+                LOG.warn("Utf8StringBuilder {} reused with {} bytes for {} with FIN {}", System.identityHashCode(out), out.length(), callback, frame.isFin());
+            }
 
             out.append(frame.getPayload());
             if (frame.isFin())

How to reproduce?
Not reproducible systematically.

analysis on possibile race condition on StringMessageSink.txt

@darnap darnap added the Bug For general bugs on Jetty side label Dec 18, 2023
@joakime joakime modified the milestones: 10.0.x, 11.0.x Dec 18, 2023
lachlan-roberts added a commit that referenced this issue Dec 19, 2023
lachlan-roberts added a commit that referenced this issue Dec 19, 2023
@joakime
Copy link
Contributor

joakime commented Dec 19, 2023

Opened PR #11084

@lachlan-roberts
Copy link
Contributor

@darnap we merged a PR which should fix this.
If you want you can build from the 11.0.x branch and confirm this fixed it for you.

@sbordet
Copy link
Contributor

sbordet commented Dec 20, 2023

@darnap from your analysis, I can see you have a custom CometD WebSocketTransport called AppWebSocketTransport.

Can you detail if it is based on the Jetty APIs or the standard Jakarta APIs?

Also, what are the reasons for using a custom transport?
I ask because perhaps we can accommodate the features in the CometD transports, rather than you having to write your own.

@darnap
Copy link
Author

darnap commented Dec 20, 2023

@lachlan-roberts Thanks for the prompt fix. We will re-run tests with this change and see what happens. Would it be enough to cherry-pick just this change onto the 11.0.18 tag of websocket-jetty-common? This would simplify our test deployment. Just to understand how the fix is supposed to work: what prevents the same race from occurring since onTextFrame() simply resets the activeMessageSink to the same textSink instance if it's null?

@sbordet The custom transport is based on the Jetty APIs. We needed to override the transport and Endpoint in order to:

  • Collect client timing statistics and implement keep-alive logic (using ping-pong frames) to avoid timeouts on strict proxies without relying on client-side logic.
  • Collect data transfer statistics: actual bytes on the wire and time from enqueuing to sending, to detect client/network congestion.
  • Override the serialization process for messages and replies. We have several messages that already contain an UTF8 representation so we avoid the overhead of converting it back to String only to have it serialized again into UTF8. We use a JsonGenerator to write the raw JSON to a ByteBuffer, then send it as a TEXT frame directly.
  • Use pooling for ByteBuffers used in serialization
  • Prioritize Reply messages to avoid cases in which we had client-side timeouts due to large message payloads.
  • Apply back-pressure on the server side when we detect that the client is congested, using writeComplete to keep track of messages waiting to be sent. (We find this is more efficient than the ACK extension, as it avoids an extra RTT and we keep fewer pending messages in memory).

If there's any better way to accomplish these goals we'd gladly avoid using a custom transport.

@sbordet
Copy link
Contributor

sbordet commented Dec 20, 2023

Would it be enough to cherry-pick just this change onto the 11.0.18 tag of websocket-jetty-common?

Yes it would be enough.

what prevents the same race from occurring since onTextFrame() simply resets the activeMessageSink to the same textSink instance if it's null?

Good point.

@lachlan-roberts I think the problem is the finally block, not the nulling of the activeMessageSink.

Thread T1 in StringMessageSink.accept() does callback.succeded() + session.demand(1), which means another thread T2 can enter StringMessageSink.accept() before T1 executes the finally block.

I'm surprised there are no NPEs!

ByteBufferMessageSink and ByteArrayMessageSink have the same (wrong) finally pattern.

The demand should be after the nulling of the out field.

@sbordet sbordet reopened this Dec 20, 2023
sbordet added a commit that referenced this issue Dec 20, 2023
…bSocket frame handling.

Now the reset of the MessageSink internal accumulators happens before the demand.

This avoids the race, since as soon as there is demand another thread could enter the MessageSink, but the accumulator has already been reset.

Signed-off-by: Simone Bordet <[email protected]>
@sbordet
Copy link
Contributor

sbordet commented Dec 20, 2023

@darnap would you be able to try #11090?

@darnap
Copy link
Author

darnap commented Dec 20, 2023

@sbordet I will, thanks.

@darnap
Copy link
Author

darnap commented Dec 21, 2023

@sbordet The issue did not occur again in the last run of tests after building with this change in place. Hopefully it's solved, thanks.

sbordet added a commit that referenced this issue Dec 29, 2023
Issue #11081 - fix race condition in WebSocket FrameHandlers (jetty-12)
sbordet added a commit that referenced this issue Jan 2, 2024
…bSocket frame handling. (#11090)

Now the reset of the MessageSink internal accumulators happens before the demand.

This avoids the race, since as soon as there is demand another thread could enter the MessageSink, but the accumulator has already been reset.

Signed-off-by: Simone Bordet <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug For general bugs on Jetty side
Projects
No open projects
4 participants