Skip to content

[AMD] Pin cache-dit==1.3.0 in rocm.Dockerfile + AMD CI install script#24924

Merged
yctseng0211 merged 6 commits into
mainfrom
bingxche/fix-amd-ci-cache-dit-upgrade-bd1b
May 11, 2026
Merged

[AMD] Pin cache-dit==1.3.0 in rocm.Dockerfile + AMD CI install script#24924
yctseng0211 merged 6 commits into
mainfrom
bingxche/fix-amd-ci-cache-dit-upgrade-bd1b

Conversation

@bingxche
Copy link
Copy Markdown
Collaborator

@bingxche bingxche commented May 11, 2026

Motivation

Fix R10 in sglang-ci-bot daily report bingxche/sglang-ci-bot#67: every multimodal-gen-test-{1,2}-gpu-amd[-rocm720] job has been red for 2+ days (~4 jobs/run) with two paired symptoms that share one root cause:

  • 1-GPU (qwen_image_t2i_cache_dit_scm_config_diffusers_1gpu):
    RuntimeError: cache-dit>=1.2.0 is required for --cache-dit-config. Please upgrade cache-dit.
    
  • 2-GPU (wan2_1_t2v parametrization):
    AttributeError: ParallelismBackend.AUTO
    

Reproduced in the most recent scheduled run on main (run 25644554985 / job 75270857375, 2026-05-11 00:48 UTC) — first failing log line is verbatim the RuntimeError above.

Why now (2 days, not 4 months)

The hard cache-dit>=1.2.0 runtime check was added back in #16662 (2026-01-22), and python/pyproject.toml was bumped to 1.3.0 in #20361 (2026-03-17). AMD CI stayed green that whole time because no AMD test actually exercised --cache-dit-config. That changed when #19213 "[diffusion] CI: add cache-dit CI tests" landed on 2026-05-10 05:38 UTC (~49 h before this PR), which added the new qwen_image_t2i_cache_dit_scm_config_diffusers_1gpu case that does pass --cache-dit-config. From that moment on, the version skew between the AMD ROCm image (cache-dit==1.1.8) and python/pyproject.toml (==1.3.0) became a hard failure.

Root cause (verified against the tree)

The cache-dit==1.1.8 shipped in AMD CI images is not inherited from a base image — it is actively installed by the ROCm Dockerfile itself. The chain:

  1. docker/rocm.Dockerfile:279-281 does mv python/pyproject_other.toml python/pyproject.toml && pip install -e "python[srt_hip,diffusion_hip]".
  2. python/pyproject_other.toml:106 (the diffusion_hip extra) pinned cache-dit==1.1.8.
  3. So every newly built ROCm image gets 1.1.8 baked in, regardless of python/pyproject.toml (which pins 1.3.0).
  4. python/sglang/multimodal_gen/runtime/cache/cache_dit_integration.py:34 does from cache_dit.parallelism import ParallelismBackend, ParallelismConfig and uses ParallelismBackend.AUTO — both added in cache-dit 1.2.0+. Hence the runtime failure.
  5. scripts/ci/amd/amd_ci_install_dependency.sh:172 ran pip install cache-dit (no --upgrade, no version), which is a no-op against the preinstalled 1.1.8 — so the script could not paper over the toml mismatch either.

A previous revision of this PR added pip install cache-dit==1.3.0 directly to the Dockerfile, but code review by Cursor Automation correctly flagged that change as a no-op: the very next RUN block (line 281) re-installs diffusion_hip, which downgrades 1.3.0 back to 1.1.8. That commit has been reverted; this PR now fixes the source of the pin instead.

Why NVIDIA was unaffected

NVIDIA CI installs the diffusion extra via pip install -e "python[dev,runai,tracing,diffusion]" (scripts/ci/cuda/ci_install_dependency.sh:227-232), which resolves against python/pyproject.toml (already at 1.3.0). The AMD path resolves against python/pyproject_other.toml, which had drifted — that's the bypass this PR closes.

Modifications

Three small surgical edits, plus a preserved bridge fix in the install script:

  1. python/pyproject_other.toml — bump diffusion_hip from cache-dit==1.1.8 to cache-dit==1.3.0. This is the actual source of truth that the ROCm Dockerfile resolves at image build time.
  2. 3rdparty/amd/wheel/sglang/pyproject.toml — bump diffusion_hip from cache-dit==1.1.8 to cache-dit==1.3.0, so the amd-sglang wheel stays in sync.
  3. scripts/ci/amd/amd_ci_install_dependency.sh — change pip install cache-dit to pip install --upgrade 'cache-dit==1.3.0'. This is a bridge fix: any AMD CI runner still pulling an image built before the toml bump lands will be force-upgraded at job start. Once new images are built (≤24 h via the nightly image release workflows), this line becomes a no-op (pip sees 1.3.0 already installed and exits).

Out of scope but worth noting: diffusion_musa in both python/pyproject_other.toml:131 and 3rdparty/amd/wheel/sglang/pyproject.toml:146 still pins cache-dit==1.1.8. The MUSA CI is a separate pipeline (pr-test-musa.yml) and not in this hotfix's blast radius. Tracked as a follow-up.

5 commits, 5 net lines changed across 3 files (the previous revision's Dockerfile edit was reverted in commit 723fdd3f).

Validation criteria

https://github.com/sgl-project/sglang/actions/runs/25648379091
image
Test passed.

Accuracy Tests

N/A — install-only change, no model code touched.

Speed Tests and Profiling

N/A — install-only change.

Checklist

  • Surgical fix; no production code changed.
  • Version pin matches python/pyproject.toml (single source of truth) across pyproject_other.toml and the amd-sglang wheel pyproject.
  • Reverted the no-op Dockerfile edit after review feedback.
  • Bridge fix in install script keeps existing images green until next image rebuild.
  • Validated end-to-end (see Validation criteria above).

Follow-up (out of scope here)

A cleaner architectural fix would be to have scripts/ci/amd/amd_ci_install_dependency.sh install via pip install -e "python[diffusion]" (the way NVIDIA CI does), removing the hand-rolled cache-dit / accelerate / pytest / huggingface_hub lines entirely. That would make python/pyproject.toml the single source of truth for AMD too. Also, diffusion_musa should be bumped to 1.3.0 for symmetry. Tracked separately, not part of this hotfix.

Review and Merge Process

  1. Ping Merge Oncalls.
  2. CODEOWNERS approval (AMD CI / Docker).
  3. Trigger AMD CI (/rerun-failed-ci or /tag-and-rerun-ci).
  4. Merge once the listed jobs are green.
Open in Web Open in Cursor 

cursoragent and others added 3 commits May 11, 2026 02:49
The AMD CI base image ships cache-dit==1.1.8, but multimodal_gen requires
>=1.2.0 (uses cache_dit.parallelism.ParallelismBackend.AUTO, added in 1.2.0)
and pyproject.toml pins ==1.3.0. The previous 'pip install cache-dit' (no
--upgrade, no version) was a no-op against the preinstalled 1.1.8, which
caused every multimodal-gen-test-{1,2}-gpu-amd[-rocm720] job to fail with:

  RuntimeError: cache-dit>=1.2.0 is required for --cache-dit-config.
  AttributeError: ParallelismBackend.AUTO

Pin to 1.3.0 (matching python/pyproject.toml) and force --upgrade so pip
replaces the image's stale 1.1.8.

Co-authored-by: Bingxu Chen <Bingxu.Chen@amd.com>
Co-authored-by: Bingxu Chen <Bingxu.Chen@amd.com>
Match the install-script change so newly built ROCm images already ship
cache-dit==1.3.0, instead of relying on the CI install script to upgrade
the stale 1.1.8 from the base image at every job start.

Co-authored-by: Bingxu Chen <Bingxu.Chen@amd.com>
@cursor
Copy link
Copy Markdown

cursor Bot commented May 11, 2026

Cursor Agent can help with this pull request. Just @cursor in comments and I'll start working on changes in this branch.
Learn more about Cursor Agents

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions Bot added the amd label May 11, 2026
@bingxche
Copy link
Copy Markdown
Collaborator Author

@cursor review

@cursor
Copy link
Copy Markdown

cursor Bot commented May 11, 2026

Skipping Bugbot: Bugbot could not find a matching SCM installation for this repository. Please reinstall the GitHub/GitLab installation and/or remove the installation from non-Bugbot accounts. Visit the Bugbot dashboard to update your settings.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing the AMD CI cache-dit issue — the amd_ci_install_dependency.sh change (pin + --upgrade) is correct and matches python/pyproject.toml.

However, the docker/rocm.Dockerfile change is most likely a no-op at image build time, because the very next RUN block (lines 264–284) renames python/pyproject_other.tomlpython/pyproject.toml and runs pip install -e "python[srt_hip,diffusion_hip]". The diffusion_hip extra in python/pyproject_other.toml still pins cache-dit==1.1.8:

diffusion_hip = [
  "sglang[diffusion_common]",
  "peft>=0.18.0,<0.19.0", # Pin to <0.19.0 due to torchao incompatibility
  "st_attn==0.0.7",
  "vsa==0.0.4",
  "runai_model_streamer>=0.15.5",
  "cache-dit==1.1.8",
]

So pip will downgrade cache-dit from 1.3.0 back to 1.1.8 right after your new pip install cache-dit==1.3.0 line. That's almost certainly why the AMD CI base image ships 1.1.8 today, and why the install-script --upgrade workaround is needed in the first place.

If the intent is for ROCm images to ship cache-dit==1.3.0 (matching python/pyproject.toml), the real fix is to bump the pin in python/pyproject_other.toml (diffusion_hip, and ideally also diffusion_musa) to 1.3.0. The same stale pin also lives in 3rdparty/amd/wheel/sglang/pyproject.toml and should be updated for consistency. Once those are bumped, the Dockerfile line you added becomes redundant.

Recommendation:

  • Keep the amd_ci_install_dependency.sh change as-is (good defensive runtime upgrade).
  • Either (a) bump cache-dit==1.1.8 → 1.3.0 in python/pyproject_other.toml (diffusion_hip) and 3rdparty/amd/wheel/sglang/pyproject.toml (diffusion_hip) and drop the Dockerfile line, or (b) drop the Dockerfile change entirely since it's overwritten downstream.
Open in Web View Automation 

Sent by Cursor Automation: Bingxu’s Automation

Comment thread docker/rocm.Dockerfile Outdated
&& pip install torchao==0.9.0 \
&& pip install pybind11
&& pip install pybind11 \
&& pip install cache-dit==1.3.0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This install is overwritten a few lines below. The next RUN block does:

rm -rf python/pyproject.toml && mv python/pyproject_other.toml python/pyproject.toml
...
python -m pip --no-cache-dir install -e "python[srt_hip,diffusion_hip]"

and diffusion_hip in python/pyproject_other.toml still pins cache-dit==1.1.8, so pip will downgrade cache-dit from 1.3.0 back to 1.1.8 during the sglang editable install. The net effect is that the built image still ships cache-dit==1.1.8, which is exactly why the amd_ci_install_dependency.sh upgrade is required at job start.

To make this line actually do something (or to render it unnecessary), bump cache-dit==1.1.8 → 1.3.0 in python/pyproject_other.toml (diffusion_hip extra, and likely also diffusion_musa) and in 3rdparty/amd/wheel/sglang/pyproject.toml. Otherwise, this line can be dropped from the PR.


# Install cache-dit for qwen_image_t2i_cache_dit_enabled test (added in PR 16204)
docker exec ci_sglang pip install --cache-dir=/sgl-data/pip-cache cache-dit || echo "cache-dit installation failed"
docker exec ci_sglang pip install --cache-dir=/sgl-data/pip-cache --upgrade 'cache-dit==1.3.0' || echo "cache-dit installation failed"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change LGTM — pinning to 1.3.0 matches python/pyproject.toml, and --upgrade is needed because the AMD CI base image ships cache-dit==1.1.8 (since python/pyproject_other.toml and 3rdparty/amd/wheel/sglang/pyproject.toml still pin 1.1.8 for diffusion_hip). Once those pyproject pins are bumped, this --upgrade step will become a no-op but won't hurt to keep as a safety net.

cursoragent and others added 3 commits May 11, 2026 05:09
This revert reverses the previous Dockerfile change after review pointed
out it was a no-op: the next RUN block (rocm.Dockerfile:281) installs
'python[srt_hip,diffusion_hip]', and the diffusion_hip extra in
python/pyproject_other.toml pins cache-dit==1.1.8, so pip downgrades the
1.3.0 we just installed. The real fix is to bump the pin in the toml
files (next two commits).

Co-authored-by: Bingxu Chen <Bingxu.Chen@amd.com>
…_hip

This is the real source of the cache-dit==1.1.8 in AMD CI images: the
ROCm Dockerfile renames pyproject_other.toml -> pyproject.toml and
installs the diffusion_hip extra, which used to pin 1.1.8. Bumping to
1.3.0 aligns with python/pyproject.toml and fixes R10
(multimodal-gen-test-{1,2}-gpu-amd[-rocm720]) at the source.

Co-authored-by: Bingxu Chen <Bingxu.Chen@amd.com>
Same fix in the amd-sglang wheel pyproject so wheel-built images stay
in sync with python/pyproject.toml and python/pyproject_other.toml.

Co-authored-by: Bingxu Chen <Bingxu.Chen@amd.com>
@github-actions github-actions Bot added the dependencies Pull requests that update a dependency file label May 11, 2026
@bingxche bingxche marked this pull request as ready for review May 11, 2026 14:08
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@yctseng0211 yctseng0211 merged commit aeb8fef into main May 11, 2026
70 of 74 checks passed
@yctseng0211 yctseng0211 deleted the bingxche/fix-amd-ci-cache-dit-upgrade-bd1b branch May 11, 2026 14:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

amd dependencies Pull requests that update a dependency file

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants