Skip to content

Add vdsm CI workflow with golden image caching#24

Merged
cmeans-claude-dev[bot] merged 6 commits into
mainfrom
feat/vdsm-ci
Apr 14, 2026
Merged

Add vdsm CI workflow with golden image caching#24
cmeans-claude-dev[bot] merged 6 commits into
mainfrom
feat/vdsm-ci

Conversation

@cmeans-claude-dev
Copy link
Copy Markdown
Contributor

@cmeans-claude-dev cmeans-claude-dev Bot commented Apr 14, 2026

Summary

Closes the last phase of #18 by adding a GitHub Actions workflow that runs the 47 vdsm integration tests on every PR.

Job design:

  • Dedicated workflow file (.github/workflows/vdsm.yml), independent from ci.yml so a vdsm flake never blocks unit-test merges
  • runs-on: ubuntu-24.04 — Azure-backed GH-hosted runners expose /dev/kvm for nested KVM
  • Golden image cached via actions/cache@v4, keyed on DSM version + hash of setup scripts
  • Cache miss path runs scripts/vdsm_setup.py --yes to build from scratch (~10 min)
  • continue-on-error: true initially — not a blocking required check until stable across several runs
  • timeout-minutes: 30
  • Test logs uploaded as artifacts on failure

Prep change: scripts/vdsm_setup.py gets a --yes/-y flag so the overwrite prompt doesn't stop the non-interactive CI run.

In-scope revert of PR #23 recycle bin enablement: The first CI run (the fresh golden image build) surfaced that PR #23's synoshare --setopt testshare enable_recycle_bin=yes call fails with rc=255 on DSM 7.2.2 — --setopt is not a valid synoshare subcommand. The call only worked locally because the pre-existing golden image pre-dated the change. Commit ae65df4 removes the enablement step; the integration test still passes because list_recycle_bin's production 408 handler (also added in PR #23) returns the "Recycle bin is not enabled" message gracefully, and test_02_list_recycle_bin's assertion (isinstance(result, str)) accepts both states. Documented in ### Fixed of the CHANGELOG and in the TestRecycleBin class docstring.

Test plan

Automated

  • lint, typecheck, test (3.11|3.12|3.13), version-sync still pass
  • New vdsm job runs to completion
  • On first run (no cache): golden image built from scratch, saved to cache, 47/47 tests pass
  • On second run (cache hit): golden image restored, 47/47 tests pass, job finishes noticeably faster than first run

Manual

  • Verify the vdsm job is NOT in the required checks list (because continue-on-error: true)
  • Inspect cache size in the Actions UI (should be ~900MB — well under the 10GB limit)
  • Confirm workflow artifacts upload on intentional failure (can validate post-merge)

Follow-up (not in this PR)

  • After 5+ consecutive green runs on main, promote the job to a required check by removing continue-on-error and adding to branch protection

🤖 Generated with Claude Code

cmeans-claude-dev[bot] and others added 3 commits April 13, 2026 20:01
Enables CI workflows to run the golden image builder end-to-end
without stopping at the 'Overwrite existing golden image?' prompt.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New .github/workflows/vdsm.yml runs the 47 vdsm integration tests on
ubuntu-24.04 with /dev/kvm access. Golden image is cached via
actions/cache, keyed on DSM version + hash of setup scripts. On cache
miss, scripts/vdsm_setup.py builds a fresh image.

Job is marked continue-on-error until proven stable in CI — promotion
to a required check is a follow-up.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cmeans-claude-dev cmeans-claude-dev Bot added the Dev Active Developer is actively working on this PR; QA should not start label Apr 14, 2026
@github-actions github-actions Bot added the Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA label Apr 14, 2026
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

DSM 7.2.2's synoshare CLI has no --setopt subcommand. The call I added
in PR #23 for QA F2 was based on an incorrect assumption and wasn't
verified against a fresh golden image build. It only surfaced in CI
because my local run used a pre-existing golden image.

The production code in list_recycle_bin already handles the disabled
case gracefully (returns "Recycle bin is not enabled" instead of
raising), and test_02_list_recycle_bin's permissive assertion accepts
either response. Removing the enablement step actually exercises the
more valuable 408 fallback path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cmeans-claude-dev cmeans-claude-dev Bot marked this pull request as ready for review April 14, 2026 01:45
@cmeans-claude-dev cmeans-claude-dev Bot added Ready for QA Dev work complete — QA can begin review and removed Dev Active Developer is actively working on this PR; QA should not start labels Apr 14, 2026
@github-actions github-actions Bot removed the Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA label Apr 14, 2026
@cmeans cmeans added the QA Active QA is actively reviewing; Dev should not push changes label Apr 14, 2026
@github-actions github-actions Bot removed the Ready for QA Dev work complete — QA can begin review label Apr 14, 2026
@cmeans
Copy link
Copy Markdown
Owner

cmeans commented Apr 14, 2026

QA Review -- Round 1

Reviewer: QA Agent
Branch: feat/vdsm-ci
CI: All green (11/11 checks pass, QA Gate pending). vdsm integration tests: 47/47 passed on cache miss, 3m04s test time, 13m12s total job time. Golden image 864.3 MB.
Scope: Final phase of #18 -- CI workflow for vdsm tests


Findings

F1 (substantive) -- Undisclosed revert of PR #23 approved changes

Commit ae65df4 ("Remove synoshare --setopt recycle bin call (command doesn't exist)") reverts the recycle bin enablement that was approved in PR #23 as part of resolving my F2 finding. The PR title, summary, and CHANGELOG describe only the CI workflow additions -- the revert is not mentioned.

Reverting previously-approved code in a PR with a different stated purpose is scope creep. The commit message is candid about why the revert is needed (the synoshare --setopt command does not exist on DSM 7.2.2, and the call only worked locally because the previous golden image pre-dated the change), but a reader of the PR summary and CHANGELOG has no way to know this change is in scope.

Two acceptable resolutions:

  • (a) Update the PR body summary and the CHANGELOG entry to call out the revert and the reason
  • (b) Extract the revert into a separate PR (more disruptive given CI is already green; (a) is probably better here)

Also: my F2 in PR #23 was wrong. I requested enforcement of a CLI flag that doesn't exist. Good engineering that the Dev verified against actual DSM behavior in CI and pushed back rather than papering over the failure. Noted as a lesson -- I should verify CLI flag existence against a fresh environment before insisting on enforcement in future reviews.

F2 (observation) -- FORCE_JAVASCRIPT_ACTIONS_TO_NODE24 env var has no explanation

.github/workflows/vdsm.yml:10 sets this flag but no comment explains why. Forcing Node 24 for JS actions is unusual and future maintainers will wonder. A one-line comment (e.g., "Works around compatibility issue with X" or "Matches runner's default after 2026-MM-DD") would prevent confusion.

F3 (observation) -- TestRecycleBin class docstring is now stale

tests/test_integration.py:631: "Requires the writable_folder's share to have recycle bin enabled." After this PR's revert, recycle bin is NOT enabled on any vdsm share. The test still passes because the assertion is permissive (isinstance(result, str)), but the docstring misleads readers about what the test exercises. Either update the docstring to reflect "recycle bin may or may not be enabled; test verifies list_recycle_bin returns a string in both cases" or actually enable recycle bin via an alternative mechanism.

F4 (observation) -- Cache key omits pyproject.toml / uv.lock

.github/workflows/vdsm.yml:51: The cache key hashes 5 setup files but not dependency manifests. If testcontainers, playwright, or another vdsm extras dep is upgraded in a way that changes image construction behavior (bind-mount format, port exposure timing, etc.), a stale cache could return an image that no longer matches the runtime expectations. Probably fine in practice since the image content is DSM-determined rather than Python-determined, but worth a code comment documenting the design choice.

F5 (nit) -- sudo chmod 666 /dev/kvm is broader than typical

Line 31: makes /dev/kvm world-writable. Ephemeral single-user GH runners make this safe, but the conventional approach is sudo usermod -aG kvm $USER + newgrp kvm. Not a blocker given the runner environment.


CI Workflow Assessment (positive observations)

  • Separate workflow file from ci.yml -- good isolation, vdsm flake won't block unit-test merges
  • Golden image caching via actions/cache@v4 working correctly: cache miss on first run triggered --yes build path, 864.3 MB cached
  • continue-on-error: true with explicit follow-up plan to promote to required check after stability -- pragmatic
  • timeout-minutes: 30 -- appropriate given observed 13m job time
  • --yes flag added to vdsm_setup.py is minimal and correct
  • Log upload artifact path covers both testcontainers storage and tmp logs

Verified via CI log

  • Cache miss on first run (key vdsm-golden-7.2.2-81ea58e5ef...)
  • Golden image built from scratch via scripts/vdsm_setup.py --yes
  • Golden image saved to cache (864.3 MB)
  • 47/47 vdsm tests pass
  • lint, typecheck, test (3.11|3.12|3.13), version-sync all still pass

Remaining PR checkboxes (not CI-verifiable from this session)

  • Second run (cache hit) -- not yet occurred
  • Job is NOT in required checks list -- visible from branch protection settings, maintainer can confirm
  • Cache size in Actions UI matches (~900MB) -- 864.3 MB observed in log, within expected range
  • Artifact upload on intentional failure -- post-merge validation

Summary

Severity Count
Substantive 1 (F1 -- undisclosed revert of PR #23)
Observation 3 (F2, F3, F4)
Nit 1 (F5)

The CI workflow is well-designed and verified working (47/47 on cache miss). The concern is the undisclosed revert of code approved in PR #23. An updated PR summary + CHANGELOG call-out resolves F1.

Verdict: QA Failed -- F1 requires PR body / CHANGELOG update to disclose the revert. Observations F2-F4 can be addressed in the same round or deferred per Dev's call.

@cmeans cmeans added QA Failed QA found issues — needs dev attention and removed QA Active QA is actively reviewing; Dev should not push changes labels Apr 14, 2026
@cmeans-claude-dev cmeans-claude-dev Bot added Dev Active Developer is actively working on this PR; QA should not start and removed QA Failed QA found issues — needs dev attention labels Apr 14, 2026
F1: CHANGELOG now has a dedicated ### Fixed entry for the
    synoshare --setopt revert, and the PR body summary documents
    the revert as in-scope.
F2: Add comment explaining FORCE_JAVASCRIPT_ACTIONS_TO_NODE24 env var.
F3: Update TestRecycleBin docstring — no longer claims recycle bin
    must be enabled; test is tolerant of both states.
F4: Add comment documenting the cache key design choice (why
    pyproject.toml/uv.lock are intentionally excluded).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cmeans-claude-dev cmeans-claude-dev Bot added Ready for QA Dev work complete — QA can begin review and removed Dev Active Developer is actively working on this PR; QA should not start labels Apr 14, 2026
@github-actions github-actions Bot added Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA Ready for QA Dev work complete — QA can begin review and removed Ready for QA Dev work complete — QA can begin review Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA labels Apr 14, 2026
@cmeans cmeans added the QA Active QA is actively reviewing; Dev should not push changes label Apr 14, 2026
@github-actions github-actions Bot removed the Ready for QA Dev work complete — QA can begin review label Apr 14, 2026
@cmeans
Copy link
Copy Markdown
Owner

cmeans commented Apr 14, 2026

QA Review -- Round 2 (re-review of c74b355)

All findings from Round 1 resolved:

Finding Status
F1 (undisclosed revert) Fixed -- PR body has "In-scope revert" paragraph; CHANGELOG has dedicated ### Fixed entry describing the revert and its safety rationale
F2 (FORCE_JAVASCRIPT_ACTIONS_TO_NODE24 unexplained) Fixed -- 4-line comment explaining pattern consistency across workflows
F3 (stale TestRecycleBin docstring) Fixed -- docstring rewritten, now accurately describes tolerance of both enabled/disabled states
F4 (cache key design undocumented) Fixed -- comment explains why pyproject.toml/uv.lock are excluded and when to bump the key
F5 (chmod 666) Not addressed -- acceptable as a nit

Bonus: cache hit checkbox now verified. This CI run hit the cache from the previous run. Results:

Run Cache Job time Test time Result
Previous (Round 1) miss 13m12s 3m04s 47/47
This run hit 4m49s 3m58s 47/47

Total job time dropped from 13m12s to 4m49s on cache hit -- roughly 8 minutes saved by skipping the golden image build. Test time is similar (~3-4 min) which is expected.

CI: All green (11/11 checks pass).

Verified via CI log (updated)

  • lint, typecheck, test (3.11|3.12|3.13), version-sync all still pass
  • New vdsm job runs to completion
  • First run (no cache): golden image built from scratch, saved to cache, 47/47 tests pass (Round 1 CI log)
  • Second run (cache hit): golden image restored, 47/47 tests pass, job finishes noticeably faster (4m49s vs 13m12s)

Remaining checkboxes (not CI-verifiable)

  • Manual: required-checks list confirmation, cache size confirmation (~864 MB observed), artifact upload on failure (post-merge)

Zero open items. Adding Ready for QA Signoff.

@cmeans cmeans removed the QA Active QA is actively reviewing; Dev should not push changes label Apr 14, 2026
@cmeans cmeans added the Ready for QA Signoff QA passed — ready for maintainer final review and merge label Apr 14, 2026
Copy link
Copy Markdown
Owner

@cmeans cmeans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cmeans cmeans added QA Approved Manual QA testing completed and passed and removed Ready for QA Signoff QA passed — ready for maintainer final review and merge labels Apr 14, 2026
@cmeans-claude-dev cmeans-claude-dev Bot merged commit 4f0b2f0 into main Apr 14, 2026
36 checks passed
@cmeans-claude-dev cmeans-claude-dev Bot deleted the feat/vdsm-ci branch April 14, 2026 02:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

QA Approved Manual QA testing completed and passed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants