Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection.send() never returns with the asyncio client #1563

Open
parmentelat opened this issue Dec 8, 2024 · 4 comments
Open

Connection.send() never returns with the asyncio client #1563

parmentelat opened this issue Dec 8, 2024 · 4 comments

Comments

@parmentelat
Copy link
Contributor

hiya there

some background for starters
recently as part of an upgrade to fedora41 I have come across release 14 of websockets
so I have taken that chance to upgrade my code to use this new API
turns out it is used in a monitoring process, that acts as a ws client to push its results upstream; and I am now facing a nasty sneaky issue with that, because after some unpredicatable amount of time - typically hours - the monitor stops looking at some nodes in a seemingly random way; that is to say, its focus shrinks overtime, and it reports the status of fewer and fewer nodes.

admittedly there have been many moving pieces since last time where this code was working fine
however after a week patiently adding debugging instructions, I have been able to nail it down a bit
and I have evidence that in my application, it is the call to
websockets.asyncio.client.ClientConnection.send()
that essentially never returns; like in, never ever even after hours
specifically I am using 14.1, and the websockets.asyncio.client flavour

so, that's all I have as hard evidence for now
I am still miles away from being able to reproduce on a simple example - I can't even do that reliably in my complex app, I only know to wait for the bug to manifest itself...

at this point I have opted for rolling back to using websockets.legacy instead, still with 14.1
and so far this has been performing all right for a full day, so I'm optimistic my issue is fixed with that older API

other than that:
as far as I can tell - but that's probably not that far - the first suspect could be this pair of lines here

while self.fragmented_send_waiter is not None:
await asyncio.shield(self.fragmented_send_waiter)

but of course I don't know that for sure, I reckon this is just a wild guess

so, that's where I am at right now, I thought this was still worth reporting, so here I am

thanks for any hint or insight wrt to this situation

@aaugustin
Copy link
Member

Oh, how annoying :-(

Let's rule out unlikely causes:

  1. You aren't sending fragmented messages i.e. calling send() with an iterable or async iterable, are you?
  2. You kept close_timeout set to the default value or set a reasonable value, didn't you?
  3. You aren't sending large enough amounts of data to create a risk of clogging the correction, are you?

Then, if I had to bet:

  1. async with self.send_context(): ... could be blocking during the exit of the context manager. In that case, when sending a text message, this line would execute but you'd never exit the context manager and jump to the next line.
  2. You could be losing the TCP connection in a way that websockets doesn't handle well. In order to prevent leaking TCP connection, websockets waits for the TCP connection to be lost before returning. It took a long time to iron out all potential issues in the original implementation; I wouldn't be surprised if the rewrite had some bugs; the I/O layer is brand new and more difficult to get right than the Sans-I/O layer.

@aaugustin
Copy link
Member

Thank you for reporting the issue BTW.

@parmentelat
Copy link
Contributor Author

parmentelat commented Dec 9, 2024

hiya there again

many thanks for the quick feedback :)

none of the 3 unlikely causes seem to apply to me; I pass send() a good-old str object; I don't mess with the timeouts; my chunks are pretty small, in the order of 1k chars or similar
between the 2 other I could not say which is the most likely in my case; it feels however like my issues might be somehow related to the server end being restarted in some way, I have no hard evidence about that but it's possible

I'm unsure how to go about helping further, insofar as I have kind of several times overburnt my anticipated time budget on this topic :)
the service based on legacy has been running for 36 hours now without a glitch, so I guess for now I'm going to stay put and leave things as they are

let me know if you do any potential progress around this issue, I will try to make it simple for me to try out any future version down the line

thanks again for the insights, much appreciated !

@aaugustin
Copy link
Member

Thank you for the information you provided. I'm going to look into it. Probably this will get ironed out long before the deprecation of the legacy implementation completes (5 years). I'm sorry for the rocky migration experience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants