Skip to content

Conversation

@rtibbles
Copy link
Member

@rtibbles rtibbles commented Oct 30, 2025

Summary

  • Add worker_soft_shutdown_timeout (reading from CELERY_WORKER_SOFT_SHUTDOWN_TIMEOUT env var, defaulting to 28s) to Celery config in settings.py
  • Add REMAP_SIGTERM=SIGQUIT to docker compose env vars to trigger soft shutdown

When K8s sends SIGTERM during pod termination, workers will now:

  1. Stop accepting new tasks
  2. Continue processing current task for up to 28 seconds
  3. Exit cleanly if task completes, or timeout after 28s
  4. Allow K8s 2s buffer before 30s grace period expires

❗ 🤖 Generated with Claude Code

References

Fixes #5000

Reviewer guidance

Does this match the spec outlined by @bjester in the issue above? I think it does - noting that we have already upgraded to a compatible version of celery https://github.com/learningequality/studio/blob/hotfixes/requirements.txt#L31

Lastly - for manual testing - what happens when a long running publish is interrupted by this? Does it get properly rerun? I think it should because the "change event" should still be marked as unapplied, and not errored, but would be good to check. It will be a bit of a bummer that it will start from the beginning again, but it's better than not happening at all!

Copy link
Member

@bjester bjester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value of CELERY_WORKER_SOFT_SHUTDOWN_TIMEOUT rightfully needs coordination with the k8s graceful shutdown, like was done with the unused infra code here, but without a coordinating change with docker's timeout, setting it to the already default value in docker-compose.yml feels unnecessary. Google says Docker's default timeout should be 10 seconds. I suggest either we remove it from the docker compose file, or coordinate it with docker's timeout.

Lastly, as mentioned on Slack, we don't really use the infra code that was changed here, so we need to work with infra to implement those changes in the right place. The changes here could stay, or be removed 🤷

@DXCanas DXCanas mentioned this pull request Dec 4, 2025
Implements soft shutdown feature from Celery 5.5.3 to prevent task
interruption during pod termination. Resolves #5000.

Changes:
- Add worker_soft_shutdown_timeout (28s) to Celery config in settings.py
- Add REMAP_SIGTERM=SIGQUIT to K8s shared env vars to trigger soft shutdown

When K8s sends SIGTERM during pod termination, workers will now:
1. Stop accepting new tasks
2. Continue processing current task for up to 28 seconds
3. Exit cleanly if task completes, or timeout after 28s
4. Allow K8s 2s buffer before 30s grace period expires

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@rtibbles rtibbles force-pushed the claude/issue-5000-hotfix-011CUcP3gvHvA6ojaMVJ1ppx branch from 4b51ed8 to b50bdb9 Compare December 16, 2025 21:52
Copy link
Member

@bjester bjester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. The env var REMAP_SIGTERM appears entirely specific to Celery so it's fine to define it for all services in the docker-compose.yml

@rtibbles rtibbles merged commit f8b957f into hotfixes Dec 18, 2025
19 checks passed
@rtibbles rtibbles deleted the claude/issue-5000-hotfix-011CUcP3gvHvA6ojaMVJ1ppx branch December 18, 2025 17:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants