Protect CI cache save against cancellation#168310
Conversation
There was a problem hiding this comment.
Pull request overview
Updates the CI workflow caching logic to ensure cache save operations still complete when a workflow run is cancelled, preventing “reserved but not uploaded” caches from breaking subsequent runs.
Changes:
- Split combined
actions/cacheusage into explicitactions/cache/restoreandactions/cache/savesteps for the base venv and uv wheel cache. - Add
always()-guarded cache save steps (with success gating where appropriate) so cache uploads can complete even during cancellation. - Add step
ids needed for outcome-based conditions (e.g.,install-os-deps,create-venv,cache-uv).
| - name: Save uv wheel cache | ||
| if: | | ||
| (success() && steps.cache-venv.outputs.cache-hit != 'true') | ||
| || (always() | ||
| && steps.create-venv.outcome == 'success' | ||
| && steps.cache-uv.outputs.cache-matched-key == '') | ||
| uses: actions/cache/save@668228422ae6a00e4ad889ee87cd7109ec5666a7 # v5.0.4 | ||
| with: | ||
| path: ${{ env.UV_CACHE_DIR }} | ||
| key: >- | ||
| ${{ runner.os }}-${{ runner.arch }}-${{ steps.python.outputs.python-version }}-${{ | ||
| steps.generate-uv-key.outputs.key }} |
There was a problem hiding this comment.
I'm questioning if the described issue will be existing on the uv cache as this one include the current datetime including seconds... I expect that is not possible because is first canceling the current workflow before starting a new one
There was a problem hiding this comment.
Not exactly, as you've guessed already.
What I've tried to achieve here is a minor optimization after a HA version bump where no caches will exist. If there are multiple PRs merged in short succession each and every one will again try to rebuild the venv without a cache. That's just unnecessary work. The change here resolves this as only the first "uv cache save" after a HA version bump will always succeed. For all other cases steps.cache-uv.outputs.cache-matched-key will be the restored cache key and thus only the default condition success() && steps.cache-venv.outputs.cache-hit != 'true' will be used, just like it is now.
| ${{ runner.os }}-${{ runner.arch }}-${{ steps.python.outputs.python-version }}-${{ | ||
| steps.generate-uv-key.outputs.key }} | ||
| - name: Save base Python virtual environment | ||
| if: always() && steps.create-venv.outcome == 'success' |
There was a problem hiding this comment.
Can we somehow detect if the workflow gets cancelled during the execution of the step?
Because in my opinion we should not upload the cache if the workflow was cancelled before the upload step was started.
Example:
Current workflow is executing "Dump pip freeze" and another PR with a dependency is merged. So uploading the cache is useless as it is already invalid and the next workflow would create a new one anyways
There was a problem hiding this comment.
Not sure I fully understand what you actually want to achieve here. Sure we can detect if an earlier step was cancelled. I'm doing that here with steps.create-venv.outcome == 'success' to prevent cache saves if the job was cancelled before the venv was fully created. It is not possible to know if the next workflow run was triggered because of another dependency bump. It might as well be just another PR. So I'd suggest to still save the venv. At worst we loose 20s for the cache upload but if it's indeed needed for the next run anyway, it doesn't need to be recreated for that one.
|
@edenhaus Any update here? |
Proposed change
If a workflow run is cancelled while a cache is being saved, it can happen that the cache key is marked as reserved while the actual cache upload fails. When the next run again tries to upload the cache, it silently fails with "Unable to reserve cache ... another job may be creating this cache.". Any subsequent jobs will then fail to restore the cache and the workflow fails.
E.g. this happened yesterday:
cancel-in-progresssetting just at the moment when the venv cache would have been saved. https://github.com/home-assistant/core/actions/runs/24403629527/job/71280946800#step:28:1This PR separates the cache restore and save steps and adds
always()conditionals to the cache save steps. This will cause them to run even if the workflow itself is being cancelled. In the case of real errors, the run would force exit 5min after that, though this is unlikely to be necessary.This could increase the time until the last job is cancelled by <1min, though the conditions are designed to only trigger if the venv was fully created in the first place. The delay is recouped on the subsequent workflow run as it doesn't need to regenerate the venv.
https://docs.github.com/en/actions/reference/workflows-and-actions/workflow-cancellation
https://docs.github.com/en/actions/reference/workflows-and-actions/contexts#steps-context
Type of change
Additional information
Checklist
ruff format homeassistant tests)If user exposed functionality or configuration variables are added/changed:
If the code communicates with devices, web services, or third-party tools:
Updated and included derived files by running:
python3 -m script.hassfest.requirements_all.txt.Updated by running
python3 -m script.gen_requirements_all.To help with the load of incoming pull requests: