Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RedisChannelLayer.receive() drops events due to asyncio.shield() in redis-py 4.5.4 #366

Closed
dcki opened this issue May 1, 2023 · 10 comments

Comments

@dcki
Copy link

dcki commented May 1, 2023

I'm attempting to upgrade a django project but running into some problems. An automated test that uses channels_redis and websockets is failing, and it seems to indicate a real bug. I think this bug is pretty serious and could result in a lot of dropped events.

I think channels_redis 4.1.0 and other versions are not compatible with the Redis client redis-py 4.5.4. I think the problem does not occur with redis 4.5.3. (However, I encounter a different problem with 4.5.3 as well...)

This issue report is not as concise and complete as I would like. (In particular I would like to be able to say more about what I observe with earlier redis client versions, particularly redis-py 4.5.3.) But I think it's important to get this conversation started soon, so I'm submitting it now anyway.

Reproducing this issue

After days of digging in code that I'm not familiar with, I've found a convincing explanation for this change in behavior and written a failing test in a fork of the channels_redis repo. (I run this test directly on MacOS 11.7 with Python 3.10.5 or 3.11.3) Here is a copy of that test:

@pytest.mark.asyncio
async def test_groups(channel_layer):
    channel_layer = RedisChannelLayer(hosts=TEST_HOSTS)
    channel_name1 = await channel_layer.new_channel(prefix="specific")
    channel_name2 = await channel_layer.new_channel(prefix="specific")
    await channel_layer.group_add("test-group", channel_name1)
    with pytest.raises(asyncio.TimeoutError):
        async with async_timeout.timeout(1):
            await channel_layer.receive(channel_name1)
    await channel_layer.group_discard("test-group", channel_name1)
    await channel_layer.group_add("test-group", channel_name2)
    #await asyncio.sleep(5)  # see below
    await channel_layer.group_send("test-group", {"type": "message.1"})
    async with async_timeout.timeout(1):
        assert (await channel_layer.receive(channel_name2))["type"] == "message.1"
    await channel_layer.flush()

If I run this test with redis-py 4.5.4 then it fails near the end waiting for receive(), unless I add a 5 second sleep before group_send() or comment out asyncio.shield() in redis.asyncio.client.Redis.execute_command() (just like the test in my project below).

To run this test, I do the following:

  1. Check out the main git branch (as it was April 8th, ba6dfcd633be7d505f739cc3c294853222350e48) in the channels_redis git repo.
  2. Paste the test into tests/test_core.py
  3. (Set up a virtualenv and activate it.)
  4. python -m pip install -e .[tests]
  5. pytest tests/test_core.py::test_groups

How I encountered this issue

The automated test in the django project I'm working on has a name like test_when_user1_clicks_button_in_browser_then_user2_can_immediately_see_update_in_second_browser, and basically does this:

  1. Create a Google Chrome browser for one test user.
  2. Create a second Google Chrome browser for a second test user.
  3. Log in, in both browsers.
  4. Login redirects to a page that has javascript that opens websocket connections during or immediately after page load.
    • When this websocket connects, the backend calls RedisChannelLayer.group_add() once. When it disconnects, group_discard() is called once (passing the same group name and channel name that were passed to group_add()).
    • In practice, either because of an implicit wait somewhere or by chance, the websocket connection consistently opens in both browsers at this point in the test.
  5. Refresh both windows.
    • In practice, the websocket connection consistently disconnects and connects in both browsers.
    • (This refresh is a side effect of how the test suite is set up and isn't needed when running only one test in the test suite, for example with python3 manage.py test my_tests.MyTests. test_when_user1_clicks_button_in_browser_then_user2_can_immediately_see_update_in_second_browser. The test suite is setup to login during suite setup and refresh before each test to have a clean page state. But this extra refresh is part of what causes the bug to repro.)
  6. In one window, click a button that sends an XHR request to a django endpoint that calls RedisChannelLayer.group_send().
  7. In the other window, wait for the event to come from the websocket.

This test passes with Python 3.10.5 on Debian 11 Bullseye (inside a Docker container running on MacOS 11.7 Big Sur with Docker Desktop Engine 20.10.23) with the following dependencies:

  • channels-redis 3.4.1
  • aioredis 1.3.1
  • hiredis 2.2.2
  • django 3.2.18
  • channels 3.0.4
  • asgiref 3.3.4
  • twisted 22.10.0

This test fails on step 7 with the following dependencies: The event never makes it to the other window via the websocket:

  • channels-redis 4.1.0
  • redis 4.5.4
  • (No aioredis)
  • (No hiredis. No particular reason hiredis was removed. Maybe I'll add it again after this issue is resolved.)
  • django 4.1.8
  • channels 4.0.0
  • asgiref 3.6.0
  • twisted 22.10.0

Also, when the test fails:

  • There is no error at any level of the call stack, at least as deep as redis.asyncio.client.Redis.execute_command().
  • I've observed CancelledError to be caught, logged by my own temporary code, and re-raised by my own temporary code. CancelledError is expected when a websocket disconnects, so that's not surprising.
  • Doing any of the following makes the test pass:
    • Adding a 5 second sleep before step 6.
    • Disabling the redirect from login (in step 4), and instead loading a blank page after login.
    • Commenting out asyncio.shield() in redis.asyncio.client.Redis.execute_command().
    • (Installing the other package versions mentioned above, instead.)
  • The following do not make the test pass:
    • Increasing the wait in step 7 to 20 seconds.

What changed

Here is what I think caused this change in behavior:

  • Prior to 4.5.4, the redis client did not use asyncio.shield() around the low level call to send the BZPOPMIN command to the Redis server.
  • channels_redis executes BZPOPMIN in RedisChannelLayer._brpop_with_clean().
  • Refresher on BZPOPMIN: It blocks until it pops something or the specified time interval has elapsed. channels_redis specifies 5 seconds by default.
  • I think that channels_redis assumes that it can abort a BZPOPMIN command using task.cancel(), but due to the asyncio.shield() that is now in the redis client, I think what happens is that task.cancel() cancels the code that handles a successful BZPOPMIN, and then the Redis server continues running the BZPOPMIN command. If after the task is cancelled the stray command returns a non-empty result, then that result gets dropped, and is never delivered.

Here is why I think that:

  • There is backup logic in the code near the BZPOPMIN call, but it doesn't add a backup to redis until after BZPOPMIN returns, which means that in the case of a task that is canceled while BZPOPMIN is in progress, the event doesn't get backed up.
  • When I add time.sleep(5) in my tests before group_send(), the test passes, which seems to indicate that a stray BZPOPMIN consumes the event when there is no sleep, and times out and does not consume the event when there is a sleep.
  • When I replace await asyncio.shield(...) with await ... in the redis client, the test passes.
  • When I modify channels_redis to import an old version of aioredis instead of redis.asyncio.client, and add shield() in aioredis, then the test fails.
    • I have not attempted to repro this recently and I'm not fully confident it's true anymore.
  • As mentioned above, when all websocket clients that are created during the project test never disconnect (and thus never cancel during BZPOPMIN), the test passes.
  • When I add a lot of print statements to channels_redis, the redis client, etc. and watch that while running the tests, and watch the output of redis-cli monitor, then what I see seems to allow the possibility that BZPOPMIN and shield() is the problem, and doesn't suggest any other explanations as far as I've been able to tell.
    • (I said "print statements" but technically I'm appending to a log file.)

How to fix: I don't know yet

I'm not sure yet what the best solution would be to fix this. Some ideas:

  • It seems like perhaps the existing backup functionality could be extended.
  • I read in a channels spec somewhere that messages should be delivered no more than once. Assuming that's relevant here, then I suppose to satisfy that requirement any fix will have to guarantee that a backup can't possibly be put back into a queue while an old event is still being processed, which might happen if an old event takes a long time to process, or if events are popped from backups after only a short time.
  • Maybe the Redis client actually shouldn't use shield(). Or maybe it should at least not run BZPOPMIN inside shield().
  • I noticed that there are existing channels_redis tests that are similar to mine, that are passing a different prefix in each new_channel() call, rather than the same prefix in every call. Maybe channels_redis should always require a unique prefix in every call to new_channel(). I assume this would reduce performance and increase computing resource usage.

Possible workaround

Passing a prefix to new_channel() almost looks like an option, but:

  • for the project I'm working on, which uses channels.generic.websocket.WebsocketConsumer, passing a prefix to new_channel() doesn't seem to be configurable. new_channel() is called without arguments in channels.consumer.AsyncConsumer.__call__(). So __call__() could be redefined, in a subclass in this project, to pass a prefix to new_channel(). Or new_channel() could be redefined in a subclass of RedisChannelLayer (probably better because new_channel() is a lot simpler than __call__()). So it can be done, but there's room for improvement.
  • I'm not sure specifying prefix is the right thing to do for a subclass of WebsocketConsumer. I don't know what all the implications are of doing that. It seems like it would at least probably reduce performance and increase computing resource usage.

Other thoughts

I wonder what all this means for the channels_redis receive lock. If a stray BZPOPMIN is still running, and another task tries to get the receive lock, then I think it will get the receive lock and start a new BZPOPMIN while the old one is still running, which seems to be something that channels_redis is trying to avoid. But depending on what the fix is for dropped events due to stray BZPOPMIN, maybe this won't matter. For example, maybe it will be fine if a stray BZPOPMIN is still running for a couple seconds after a new BZPOPMIN starts.

@carltongibson
Copy link
Member

Hi @dcki — thanks for this. A lot to look at there.

Take a look at #348, and the linked discussions there. It seems related to that I think (at initial read through)

@dcki
Copy link
Author

dcki commented May 2, 2023

Thanks @carltongibson. I think this issue is different, but I'm not sure I understand #348.

Reading that did make me wonder if Python 3.11.3 might help, since it sounds like it has an update related to channels_redis, but the test in the code snippet above still passes and fails in the same ways on Python 3.11.3 (fails unless a 5 second sleep is added or a particular asyncio.shield() is commented out, as described above).

@dcki
Copy link
Author

dcki commented May 2, 2023

@carltongibson as far as I can tell, #348 is resolved, because when I un-xfail the test_receive_cancel and test_message_expiry__group_send__one_channel_expires_message tests, and run pytest tests/test_core.py with Python 3.11.3 (and redis-py 4.5.4), then all tests pass.

However, in this issue, the test I included still fails with Python 3.11.3 and redis-py 4.5.4, which makes me think that this issue is different from #348. (I'm still not sure I understand #348 though.)

@carltongibson
Copy link
Member

Hey @dcki. You could open a PR removing those xfails if you like... 🙂 -- it would be good it that has been resolved. (You'd also get the attention of the folks on that issue commenting there.)

I need to have a play to be able to say more, but please do keep digging.

@dcki
Copy link
Author

dcki commented May 2, 2023

Unrelated to your last message about xfails:

Looking at the last couple comments in this redis-py issue, apparently there is at least one person that claims to know what they're talking about, that says asyncio.shield() should be removed. So maybe this is all a redis-py issue. I'm not sure what channels_redis should do to handle that in the meantime...

@dcki
Copy link
Author

dcki commented May 2, 2023

Even more confirmation about asyncio.shield(), from contributor chayim who put it in:

Now getting back to this you're bang on the asyncio.shield change is absolutely a bandaid, in order to get things secure, and out. It's not ideal - but at the time, with what was known, I felt (perhaps incorrectly) that this was the right move forward.

redis/redis-py#2695 (comment)

That was a week ago, April 23rd, and that PR looks pretty active, so it seems like a fix for this issue here in channels_redis might be on the way.

One action that might be good to take now would be to just put in my test, above, with xfails. Assuming that test does use channels_redis in a way that is intended to be supported. I'd like to create a PR for that if that seems like a good idea.

@carltongibson
Copy link
Member

@dcki It would certainly be worth a PR, so that it's easy to play with. 👍

clavay added a commit to clavay/PyScada that referenced this issue May 3, 2023
since the version 4 of channel redis version, the channel layer is not
empty after the first read

we read it again to empty it

maybe it is related to
django/channels_redis#348
or to django/channels_redis#366
@dcki
Copy link
Author

dcki commented May 8, 2023

redis-py 4.5.5 was released today and both tests above pass when using that version of redis-py. 🎉

@carltongibson
Copy link
Member

That's great @dcki.

Were there any tests you still wanted to add to the test suite here?

clavay added a commit to pyscada/PyScada that referenced this issue May 17, 2023
since the version 4 of channel redis version, the channel layer is not
empty after the first read

we read it again to empty it

maybe it is related to
django/channels_redis#348
or to django/channels_redis#366
@carltongibson
Copy link
Member

#377 added explicit testing against multiple redis-py versions:

 redis46: redis>=4.6,<4.7
 redis50: redis>=5.0,<5.1
 redismain: https://github.com/redis/redis-py/archive/master.tar.gz

Any of those should work. Please open a new issue if you hit problems with one of those versions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants