Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FAL-2030] Updates kombu package to support multi-tenant redis authentication #28020

Closed
wants to merge 2 commits into from

Conversation

pomegranited
Copy link
Contributor

@pomegranited pomegranited commented Jun 24, 2021

Description

The release of Redis 6 introduced the ability to share this service among multiple accounts, securing the access to specific keys using a username/password and ACLs.

Celery uses kombu to provide redis brokering for the message queues, but kombu was missing a couple of key enhancements to support multi-tenant redis:

This PR allows us to backport these changes without upgrading celery, since celery is constrained to <5.0. Upgrading celery requires many more code changes and testing, which are out of scope here.

Supporting information

OpenCraft wants to use multi-tenant redis for their shared hosting service, so we can stop maintaining RabbitMQ.

Testing instructions

Sandbox: Provisioning

The extended heartbeat check for celery is sufficient to test that these changes don't disrupt functionality on the sandbox as configured below.

  1. Visit https://pr28020.sandbox.opencraft.hosting/heartbeat?extended
  2. Ensure that the celery check passes.

Configuration extra settings:

# Use redis instead of rabbitmq
EDXAPP_CELERY_BROKER_TRANSPORT: 'redis'
EDXAPP_CELERY_BROKER_HOSTNAME: 'redis-ocim-dev.opencraft.hosting:6379'
EDXAPP_CELERY_BROKER_USE_SSL: true
EDXAPP_CELERY_BROKER_VHOST: '/0'

# Using https://github.com/open-craft/kombu/gabor/FAL-2030 lets us specify a user+password for multi-tenant redis6
EDXAPP_CELERY_PASSWORD: '<REDACTED>'
EDXAPP_CELERY_USER: 'fal-2030-master'

# Using https://github.com/open-craft/edx-platform/jill/redis-kombu-auth lets us configure additional transport options.
# and https://github.com/open-craft/kombu/gabor/FAL-2030 lets us specify a global key prefix.
EDXAPP_CMS_ENV_EXTRA:
  CELERY_BROKER_TRANSPORT_OPTIONS:
    global_keyprefix: '_fal-2030-master_'

EDXAPP_LMS_ENV_EXTRA:
  CELERY_BROKER_TRANSPORT_OPTIONS:
    global_keyprefix: '_fal-2030-master_'

Deadline

None

Other information

  1. We assume edX would rather the backported kombu branch/tag exist on a fork that they own, so please feel free to create one and we can submit a PR against it to support this backport.
  2. Alternatively we can use this open-craft fork/tag: v4.6.11.1.

Reviewers

@openedx-webhooks
Copy link

openedx-webhooks commented Jun 24, 2021

Thanks for the pull request, @pomegranited! I've created OSPR-5880 to keep track of it in JIRA, where we prioritize reviews. Please note that it may take us up to several weeks or months to complete a review and merge your PR.

Feel free to add as much of the following information to the ticket as you can:

  • supporting documentation
  • Open edX discussion forum threads
  • timeline information ("this must be merged by XX date", and why that is)
  • partner information ("this is a course on edx.org")
  • any other information that can help Product understand the context for the PR

All technical communication about the code itself will be done via the GitHub pull request interface. As a reminder, our process documentation is here.

Please let us know once your PR is ready for our review and all tests are green.

@openedx-webhooks openedx-webhooks added needs triage open-source-contribution PR author is not from Axim or 2U labels Jun 24, 2021
@pomegranited pomegranited marked this pull request as draft June 24, 2021 02:44
@pomegranited
Copy link
Contributor Author

Failed python test looks unrelated, will re-run.

Run Tests / commonlib-unit / xmodule.xmodule.modulestore.tests.test_assetstore.TestMongoAssetMetadataStorage.test_get_all_assets_with_paging_1 (from pytest)
Failing for the past 1 build (Since Failed#31999 )
Took 0.5 sec.
Error Message

AssertionError: assert 'demo.swf' == 'weather_patterns.bmp'
  - weather_patterns.bmp
  + demo.swf

@pomegranited
Copy link
Contributor Author

jenkins run python

@pomegranited
Copy link
Contributor Author

pomegranited commented Jun 28, 2021

This failed python test also looks unrelated to this change, will re-run:

Run Tests / commonlib-unit / xmodule.xmodule.tests.test_capa_module.ProblemBlockTest.test_random_seed_with_reset_6_onreset (from pytest)
Error Message

AssertionError: Could not get a new seed from reset after 10 tries

@pomegranited
Copy link
Contributor Author

jenkins run python

@natabene
Copy link
Contributor

@pomegranited Thank you for your contribution. Please let me know once it is ready for edX review.

@openedx-webhooks openedx-webhooks added waiting on author PR author needs to resolve review requests, answer questions, fix tests, etc. and removed needs triage labels Jul 1, 2021
Constrains kombu to https://github.com/open-craft/kombu/tree/ggabor/FAL-2030-4.6.11
to allow us to use celery with multi-tenant redis.

Ran `make update` to incorporate this constraint into the requirements.
which can be configured from the lms/studio environment
@pomegranited pomegranited marked this pull request as ready for review August 28, 2021 08:25
@openedx-webhooks openedx-webhooks added needs triage and removed waiting on author PR author needs to resolve review requests, answer questions, fix tests, etc. labels Aug 28, 2021
@gabor-boros
Copy link
Contributor

👍 🎉 @pomegranited this looks good to me!

  • I tested what's in the test instructions
  • I read through the code
  • I checked for accessibility issues
  • Includes documentation
  • I made sure any change in configuration variables is reflected in the corresponding client's configuration-secure repository.

@edx-status-bot
Copy link

Your PR has finished running tests. There were no failures.

@pomegranited
Copy link
Contributor Author

@natabene This task is ready for edX review.

If engineering is ok with what we're trying to do here, then maybe a core commiter could do the actual review, e.g. @bradenmacdonald or @Agrendalath?

@natabene
Copy link
Contributor

natabene commented Sep 2, 2021

@jmbowman Are you ok with @bradenmacdonald or @Agrendalath reviewing this?

@pomegranited
Copy link
Contributor Author

Nudge @jmbowman CC @natabene :)

@natabene
Copy link
Contributor

natabene commented Sep 8, 2021

@pomegranited We gave it a quick look and feel that edX needs a closer look, so let's wait for edX review on this PR.

@feanil
Copy link
Contributor

feanil commented Sep 10, 2021

@pomegranited looking at the changelog for Celery 5.0, I don't see any major changes that would cause issues. Is the reluctance to upgrade because of a fear of breaking edx.org without further testing, or something else? My ideal outcome would be that we upgrade celery and that OpenCraft can lead that charge with support from edX where needed.

If we don't pursue that, adding the fork here would come with an expectation that open-craft is maintaining the fork and backporting any security updates that get pushed upstream, is that something you can commit to if we were to merge this?

@pomegranited
Copy link
Contributor Author

Hi @feanil, you're totally right, upgrading celery would be the most straightforward way to pull in these changes, but we didn't have budget for the testing and other changes which may be required to perform that upgrade. I see that the comments around the celery constraint have been updated recently -- do you know if this update is on edX's roadmap already?

If we don't pursue that, adding the fork here would come with an expectation that open-craft is maintaining the fork and backporting any security updates that get pushed upstream, is that something you can commit to if we were to merge this?

Hmm.. probably not, given the above budget constraints.

@jmbowman
Copy link
Contributor

We actually did most of the work needed to upgrade to Celery 5 early this year while debugging an issue that turned out to be related to our settings. We held off because quite a few people seemed to be reporting problems with the early 5.x releases, but hopefully that's sorted out by now. There's more context in https://openedx.atlassian.net/browse/BOM-2164 .

@feanil
Copy link
Contributor

feanil commented Sep 13, 2021

I'm currently reluctant to make this change if we don't have a clear owner for undoing it later. I'd prefer to see the proper upgrade go through instead and I'm wondering if there is a way to spend energy on that instead of this? If OC can take a first stab at it, perhaps edX can help shepard it through. @jmbowman what are your thoughts on that?

@feanil feanil requested a review from robrap September 13, 2021 15:25
Copy link
Contributor

@nizarmah nizarmah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pomegranited unfortunately, the changes introduced in this pull request are resulting in the timing out of the celery workers in the sandbox linked in the pull request's description.

The app server(s) provisioned initially failed by timing out during the TASK [demo : import demo course].

However, the issue was masked/hidden by the DEMO_ROLE_ENABLED: false option added to the instance's configuration.

The reason it was timing out is because the celery workers were attempting to import the demo course; however, the celery workers were stuck in a loop of timing out, terminating, and restarting.

The issue shows up again, now, when you check the extended heartbeat of the instance you linked. It gets stuck loading, and then forwards you to a server error page.

/edx/bin/supervisorctl fg lms logs.
[2021-09-13 17:25:35 +0000] [1213781] [INFO] GET /heartbeat
[2021-09-13 17:26:30 +0000] [813] [CRITICAL] WORKER TIMEOUT (pid:1213336)
[2021-09-13 17:26:30 +0000] [1213336] [INFO] Worker exiting (pid: 1213336)
2021-09-13 17:26:30,808 INFO 1213336 [newrelic.core.agent] [user None] [ip 54.227.245.241] agent.py:740 - New Relic Python Agent Shutdown
[2021-09-13 17:26:31 +0000] [813] [WARNING] Worker with pid 1213336 was terminated due to signal 9
/edx/bin/supervisorctl fg edxapp_worker:lms_high_1 logs.
2021-09-13 17:37:08,807 ERROR 1213537 [celery.worker.consumer.consumer] [user None] [ip None] consumer.py:428 - consumer: Cannot connect to redis://fal-2030-master:**@redis-ocim-dev.opencraft.hosting:6379/0: Error 111 connecting to redis-ocim-dev.opencraft.hosting:6379. Connection refused..

I don't have much context on the changes to understand exactly what is causing the issue and resolve it, so would you please be able to debug the issue more closely?

@pomegranited
Copy link
Contributor Author

@nizarmah
bq. unfortunately, the changes introduced in this pull request are resulting in the timing out of the celery workers in the sandbox linked in the pull request's description.

These issues are more likely due to our Ocim setup than the changes in this PR -- Ocim prod doesn't use redis by default for new appservers (because this change isn't available in all the branches), but you have to add this setting in order to use rabbitmq (which is on our master watched fork config):

# SE-3680 2020-11-25 SE-3680 uptream master using redis, not rabbitmq, but we still are
EDXAPP_CELERY_BROKER_TRANSPORT: 'amqp'

@pomegranited
Copy link
Contributor Author

I'm currently reluctant to make this change if we don't have a clear owner for undoing it later. I'd prefer to see the proper upgrade go through instead and I'm wondering if there is a way to spend energy on that instead of this?

@feanil Understood.

If OC can take a first stab at it, perhaps edX can help shepherd it through. @jmbowman what are your thoughts on that?

Sure, I can create a task to do this next sprint, if you would accept the OSPR?

We enable this check_celery function by default in our extended heartbeats, so that does a basic "do tasks work" check. Are there any other celery tasks that we need to test to make sure the upgrade worked?

@robrap
Copy link
Contributor

robrap commented Sep 14, 2021

@pomegranited: In place of this OSPR, we would prefer to just complete the required celery upgrade. In the past, our team had checked and the celery upgrade seemed to pass tests, but early on in the celery 5.x release people were reporting issues that made it seem unstable. Now that some time has passed since its initial release, next steps for the actual celery upgrade would be:

  1. Unpin and ensure that tests continue to pass.
  2. Whatever manual testing is readily available.
  3. Double check celery 5.x release notes and issues to see if it seems more stable.
  4. Deploy upgrade to edx.org and either it works, or we rollback with more info about missing tests and fixes required.

Is there any parts of 1-3 your team wants to take on to move this forward faster? We can clearly take care of 4. If all goes well, we are done. If not, we can discuss next steps.

Regarding point 2, you had written:

We enable this check_celery function by default in our extended heartbeats, so that does a basic "do tasks work" check. Are there any other celery tasks that we need to test to make sure the upgrade worked?

I don't think we have any specific plans here, so if that covers any testing you'd see on a sandbox, maybe that is enough. I think we are willing to take on a little risk here and rollback if there are issues.

Let me know what you think.

@pomegranited
Copy link
Contributor Author

@robrap @jmbowman I'm glad to hear that edX has already done most of the upgrade work -- that makes it much more possible for us to finish it up. Thank you for laying out such a clear plan!

The stability issues are the most worrisome, since it's difficult to load test on sandboxes, but we'll investigate and see what we can find.

I'll make sure that your points are addressed in the OSPR, which should land sometime in the next couple weeks.

Closing this PR, thanks for taking the time to discuss! CC @feanil

@openedx-webhooks
Copy link

@pomegranited Even though your pull request wasn’t merged, please take a moment to answer a two question survey so we can improve your experience in the future.

@pomegranited
Copy link
Contributor Author

Update @robrap @jmbowman CC @feanil : We don't have time during the next sprint to submit a PR to upgrade celery, so will aim to get this assigned in the sprint starting 5 October.

edx-community-bot referenced this pull request Oct 13, 2021
## Description

Cherry-pick of https://github.com/edx/edx-platform/pull/28849

The release of Redis 6 introduced the ability to share this service among multiple accounts, securing the access to specific keys using a username/password and ACLs. Celery uses kombu to provide Redis brokering for the message queues, but kombu was missing a couple of key enhancements to support multi-tenant redis.

This PR allows to configure the broker transport options to enjoy the benefits of the above-mentioned updates.

## Supporting information

OpenCraft wants to use multi-tenant Redis for their shared hosting service, so we can stop maintaining RabbitMQ.

## Deadline

None

## Other information

The initial PR that partially contained these changes: https://github.com/edx/edx-platform/pull/28020

## Reviewers

- [ ] @pomegranited 
- [ ] @Agrendalath
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
open-source-contribution PR author is not from Axim or 2U rejected
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants