-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RedisChannelLayer.receive()
drops events due to asyncio.shield()
in redis-py 4.5.4
#366
Comments
Thanks @carltongibson. I think this issue is different, but I'm not sure I understand #348. Reading that did make me wonder if Python 3.11.3 might help, since it sounds like it has an update related to |
@carltongibson as far as I can tell, #348 is resolved, because when I un- However, in this issue, the test I included still fails with Python 3.11.3 and redis-py 4.5.4, which makes me think that this issue is different from #348. (I'm still not sure I understand #348 though.) |
Hey @dcki. You could open a PR removing those xfails if you like... 🙂 -- it would be good it that has been resolved. (You'd also get the attention of the folks on that issue commenting there.) I need to have a play to be able to say more, but please do keep digging. |
Unrelated to your last message about xfails: Looking at the last couple comments in this redis-py issue, apparently there is at least one person that claims to know what they're talking about, that says |
Even more confirmation about
That was a week ago, April 23rd, and that PR looks pretty active, so it seems like a fix for this issue here in One action that might be good to take now would be to just put in my test, above, with |
@dcki It would certainly be worth a PR, so that it's easy to play with. 👍 |
since the version 4 of channel redis version, the channel layer is not empty after the first read we read it again to empty it maybe it is related to django/channels_redis#348 or to django/channels_redis#366
redis-py 4.5.5 was released today and both tests above pass when using that version of redis-py. 🎉 |
That's great @dcki. Were there any tests you still wanted to add to the test suite here? |
since the version 4 of channel redis version, the channel layer is not empty after the first read we read it again to empty it maybe it is related to django/channels_redis#348 or to django/channels_redis#366
#377 added explicit testing against multiple redis-py versions:
Any of those should work. Please open a new issue if you hit problems with one of those versions. |
I'm attempting to upgrade a django project but running into some problems. An automated test that uses
channels_redis
and websockets is failing, and it seems to indicate a real bug. I think this bug is pretty serious and could result in a lot of dropped events.I think
channels_redis
4.1.0 and other versions are not compatible with the Redis client redis-py 4.5.4. I think the problem does not occur with redis 4.5.3. (However, I encounter a different problem with 4.5.3 as well...)This issue report is not as concise and complete as I would like. (In particular I would like to be able to say more about what I observe with earlier redis client versions, particularly redis-py 4.5.3.) But I think it's important to get this conversation started soon, so I'm submitting it now anyway.
Reproducing this issue
After days of digging in code that I'm not familiar with, I've found a convincing explanation for this change in behavior and written a failing test in a fork of the channels_redis repo. (I run this test directly on MacOS 11.7 with Python 3.10.5 or 3.11.3) Here is a copy of that test:
If I run this test with redis-py 4.5.4 then it fails near the end waiting for
receive()
, unless I add a 5 second sleep beforegroup_send()
or comment outasyncio.shield()
inredis.asyncio.client.Redis.execute_command()
(just like the test in my project below).To run this test, I do the following:
main
git branch (as it was April 8th,ba6dfcd633be7d505f739cc3c294853222350e48
) in the channels_redis git repo.tests/test_core.py
python -m pip install -e .[tests]
pytest tests/test_core.py::test_groups
How I encountered this issue
The automated test in the django project I'm working on has a name like
test_when_user1_clicks_button_in_browser_then_user2_can_immediately_see_update_in_second_browser
, and basically does this:RedisChannelLayer.group_add()
once. When it disconnects,group_discard()
is called once (passing the same group name and channel name that were passed togroup_add()
).python3 manage.py test my_tests.MyTests. test_when_user1_clicks_button_in_browser_then_user2_can_immediately_see_update_in_second_browser
. The test suite is setup to login during suite setup and refresh before each test to have a clean page state. But this extra refresh is part of what causes the bug to repro.)RedisChannelLayer.group_send()
.This test passes with Python 3.10.5 on Debian 11 Bullseye (inside a Docker container running on MacOS 11.7 Big Sur with Docker Desktop Engine 20.10.23) with the following dependencies:
This test fails on step 7 with the following dependencies: The event never makes it to the other window via the websocket:
Also, when the test fails:
redis.asyncio.client.Redis.execute_command()
.CancelledError
to be caught, logged by my own temporary code, and re-raised by my own temporary code.CancelledError
is expected when a websocket disconnects, so that's not surprising.asyncio.shield()
inredis.asyncio.client.Redis.execute_command()
.What changed
Here is what I think caused this change in behavior:
asyncio.shield()
around the low level call to send theBZPOPMIN
command to the Redis server.channels_redis
executesBZPOPMIN
inRedisChannelLayer._brpop_with_clean()
.BZPOPMIN
: It blocks until it pops something or the specified time interval has elapsed.channels_redis
specifies 5 seconds by default.channels_redis
assumes that it can abort aBZPOPMIN
command usingtask.cancel()
, but due to theasyncio.shield()
that is now in the redis client, I think what happens is thattask.cancel()
cancels the code that handles a successfulBZPOPMIN
, and then the Redis server continues running theBZPOPMIN
command. If after the task is cancelled the stray command returns a non-empty result, then that result gets dropped, and is never delivered.Here is why I think that:
BZPOPMIN
call, but it doesn't add a backup to redis until afterBZPOPMIN
returns, which means that in the case of a task that is canceled whileBZPOPMIN
is in progress, the event doesn't get backed up.time.sleep(5)
in my tests beforegroup_send()
, the test passes, which seems to indicate that a strayBZPOPMIN
consumes the event when there is nosleep
, and times out and does not consume the event when there is asleep
.await asyncio.shield(...)
withawait ...
in the redis client, the test passes.channels_redis
to import an old version ofaioredis
instead ofredis.asyncio.client
, and addshield()
in aioredis, then the test fails.BZPOPMIN
), the test passes.redis-cli monitor
, then what I see seems to allow the possibility thatBZPOPMIN
andshield()
is the problem, and doesn't suggest any other explanations as far as I've been able to tell.How to fix: I don't know yet
I'm not sure yet what the best solution would be to fix this. Some ideas:
shield()
. Or maybe it should at least not runBZPOPMIN
insideshield()
.prefix
in eachnew_channel()
call, rather than the same prefix in every call. Maybechannels_redis
should always require a unique prefix in every call tonew_channel()
. I assume this would reduce performance and increase computing resource usage.Possible workaround
Passing a
prefix
tonew_channel()
almost looks like an option, but:channels.generic.websocket.WebsocketConsumer
, passing a prefix tonew_channel()
doesn't seem to be configurable.new_channel()
is called without arguments inchannels.consumer.AsyncConsumer.__call__()
. So__call__()
could be redefined, in a subclass in this project, to pass a prefix tonew_channel()
. Ornew_channel()
could be redefined in a subclass of RedisChannelLayer (probably better becausenew_channel()
is a lot simpler than__call__()
). So it can be done, but there's room for improvement.Other thoughts
I wonder what all this means for the
channels_redis
receive lock. If a strayBZPOPMIN
is still running, and another task tries to get the receive lock, then I think it will get the receive lock and start a newBZPOPMIN
while the old one is still running, which seems to be something thatchannels_redis
is trying to avoid. But depending on what the fix is for dropped events due to strayBZPOPMIN
, maybe this won't matter. For example, maybe it will be fine if a strayBZPOPMIN
is still running for a couple seconds after a newBZPOPMIN
starts.The text was updated successfully, but these errors were encountered: