-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FAL-2030] Updates kombu package to support multi-tenant redis authentication #28020
Conversation
Thanks for the pull request, @pomegranited! I've created OSPR-5880 to keep track of it in JIRA, where we prioritize reviews. Please note that it may take us up to several weeks or months to complete a review and merge your PR. Feel free to add as much of the following information to the ticket as you can:
All technical communication about the code itself will be done via the GitHub pull request interface. As a reminder, our process documentation is here. Please let us know once your PR is ready for our review and all tests are green. |
Failed python test looks unrelated, will re-run.
|
jenkins run python |
This failed python test also looks unrelated to this change, will re-run:
|
jenkins run python |
@pomegranited Thank you for your contribution. Please let me know once it is ready for edX review. |
Constrains kombu to https://github.com/open-craft/kombu/tree/ggabor/FAL-2030-4.6.11 to allow us to use celery with multi-tenant redis. Ran `make update` to incorporate this constraint into the requirements.
which can be configured from the lms/studio environment
bfa5357
to
3c1a43e
Compare
👍 🎉 @pomegranited this looks good to me!
|
Your PR has finished running tests. There were no failures. |
@natabene This task is ready for edX review. If engineering is ok with what we're trying to do here, then maybe a core commiter could do the actual review, e.g. @bradenmacdonald or @Agrendalath? |
@jmbowman Are you ok with @bradenmacdonald or @Agrendalath reviewing this? |
@pomegranited We gave it a quick look and feel that edX needs a closer look, so let's wait for edX review on this PR. |
@pomegranited looking at the changelog for Celery 5.0, I don't see any major changes that would cause issues. Is the reluctance to upgrade because of a fear of breaking edx.org without further testing, or something else? My ideal outcome would be that we upgrade celery and that OpenCraft can lead that charge with support from edX where needed. If we don't pursue that, adding the fork here would come with an expectation that open-craft is maintaining the fork and backporting any security updates that get pushed upstream, is that something you can commit to if we were to merge this? |
Hi @feanil, you're totally right, upgrading celery would be the most straightforward way to pull in these changes, but we didn't have budget for the testing and other changes which may be required to perform that upgrade. I see that the comments around the
Hmm.. probably not, given the above budget constraints. |
We actually did most of the work needed to upgrade to Celery 5 early this year while debugging an issue that turned out to be related to our settings. We held off because quite a few people seemed to be reporting problems with the early 5.x releases, but hopefully that's sorted out by now. There's more context in https://openedx.atlassian.net/browse/BOM-2164 . |
I'm currently reluctant to make this change if we don't have a clear owner for undoing it later. I'd prefer to see the proper upgrade go through instead and I'm wondering if there is a way to spend energy on that instead of this? If OC can take a first stab at it, perhaps edX can help shepard it through. @jmbowman what are your thoughts on that? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pomegranited unfortunately, the changes introduced in this pull request are resulting in the timing out of the celery workers in the sandbox linked in the pull request's description.
The app server(s) provisioned initially failed by timing out during the TASK [demo : import demo course]
.
However, the issue was masked/hidden by the DEMO_ROLE_ENABLED: false
option added to the instance's configuration.
The reason it was timing out is because the celery workers were attempting to import the demo course; however, the celery workers were stuck in a loop of timing out, terminating, and restarting.
The issue shows up again, now, when you check the extended heartbeat of the instance you linked. It gets stuck loading, and then forwards you to a server error page.
/edx/bin/supervisorctl fg lms
logs.
[2021-09-13 17:25:35 +0000] [1213781] [INFO] GET /heartbeat [2021-09-13 17:26:30 +0000] [813] [CRITICAL] WORKER TIMEOUT (pid:1213336) [2021-09-13 17:26:30 +0000] [1213336] [INFO] Worker exiting (pid: 1213336) 2021-09-13 17:26:30,808 INFO 1213336 [newrelic.core.agent] [user None] [ip 54.227.245.241] agent.py:740 - New Relic Python Agent Shutdown [2021-09-13 17:26:31 +0000] [813] [WARNING] Worker with pid 1213336 was terminated due to signal 9
/edx/bin/supervisorctl fg edxapp_worker:lms_high_1
logs.
2021-09-13 17:37:08,807 ERROR 1213537 [celery.worker.consumer.consumer] [user None] [ip None] consumer.py:428 - consumer: Cannot connect to redis://fal-2030-master:**@redis-ocim-dev.opencraft.hosting:6379/0: Error 111 connecting to redis-ocim-dev.opencraft.hosting:6379. Connection refused..
I don't have much context on the changes to understand exactly what is causing the issue and resolve it, so would you please be able to debug the issue more closely?
@nizarmah These issues are more likely due to our Ocim setup than the changes in this PR -- Ocim prod doesn't use redis by default for new appservers (because this change isn't available in all the branches), but you have to add this setting in order to use rabbitmq (which is on our master watched fork config):
|
@feanil Understood.
Sure, I can create a task to do this next sprint, if you would accept the OSPR? We enable this |
@pomegranited: In place of this OSPR, we would prefer to just complete the required celery upgrade. In the past, our team had checked and the celery upgrade seemed to pass tests, but early on in the celery 5.x release people were reporting issues that made it seem unstable. Now that some time has passed since its initial release, next steps for the actual celery upgrade would be:
Is there any parts of 1-3 your team wants to take on to move this forward faster? We can clearly take care of 4. If all goes well, we are done. If not, we can discuss next steps. Regarding point 2, you had written:
I don't think we have any specific plans here, so if that covers any testing you'd see on a sandbox, maybe that is enough. I think we are willing to take on a little risk here and rollback if there are issues. Let me know what you think. |
@robrap @jmbowman I'm glad to hear that edX has already done most of the upgrade work -- that makes it much more possible for us to finish it up. Thank you for laying out such a clear plan! The stability issues are the most worrisome, since it's difficult to load test on sandboxes, but we'll investigate and see what we can find. I'll make sure that your points are addressed in the OSPR, which should land sometime in the next couple weeks. Closing this PR, thanks for taking the time to discuss! CC @feanil |
@pomegranited Even though your pull request wasn’t merged, please take a moment to answer a two question survey so we can improve your experience in the future. |
## Description Cherry-pick of https://github.com/edx/edx-platform/pull/28849 The release of Redis 6 introduced the ability to share this service among multiple accounts, securing the access to specific keys using a username/password and ACLs. Celery uses kombu to provide Redis brokering for the message queues, but kombu was missing a couple of key enhancements to support multi-tenant redis. This PR allows to configure the broker transport options to enjoy the benefits of the above-mentioned updates. ## Supporting information OpenCraft wants to use multi-tenant Redis for their shared hosting service, so we can stop maintaining RabbitMQ. ## Deadline None ## Other information The initial PR that partially contained these changes: https://github.com/edx/edx-platform/pull/28020 ## Reviewers - [ ] @pomegranited - [ ] @Agrendalath
Description
The release of Redis 6 introduced the ability to share this service among multiple accounts, securing the access to specific keys using a username/password and ACLs.
Celery uses
kombu
to provide redis brokering for the message queues, butkombu
was missing a couple of key enhancements to support multi-tenant redis:This PR allows us to backport these changes without upgrading celery, since celery is constrained to <5.0. Upgrading celery requires many more code changes and testing, which are out of scope here.
Supporting information
OpenCraft wants to use multi-tenant redis for their shared hosting service, so we can stop maintaining RabbitMQ.
Testing instructions
Sandbox: Provisioning
The extended heartbeat check for celery is sufficient to test that these changes don't disrupt functionality on the sandbox as configured below.
celery
check passes.Configuration extra settings:
Deadline
None
Other information
Reviewers