[AMD] Pin cache-dit==1.3.0 in rocm.Dockerfile + AMD CI install script#24924
Conversation
The AMD CI base image ships cache-dit==1.1.8, but multimodal_gen requires
>=1.2.0 (uses cache_dit.parallelism.ParallelismBackend.AUTO, added in 1.2.0)
and pyproject.toml pins ==1.3.0. The previous 'pip install cache-dit' (no
--upgrade, no version) was a no-op against the preinstalled 1.1.8, which
caused every multimodal-gen-test-{1,2}-gpu-amd[-rocm720] job to fail with:
RuntimeError: cache-dit>=1.2.0 is required for --cache-dit-config.
AttributeError: ParallelismBackend.AUTO
Pin to 1.3.0 (matching python/pyproject.toml) and force --upgrade so pip
replaces the image's stale 1.1.8.
Co-authored-by: Bingxu Chen <Bingxu.Chen@amd.com>
Co-authored-by: Bingxu Chen <Bingxu.Chen@amd.com>
Match the install-script change so newly built ROCm images already ship cache-dit==1.3.0, instead of relying on the CI install script to upgrade the stale 1.1.8 from the base image at every job start. Co-authored-by: Bingxu Chen <Bingxu.Chen@amd.com>
|
Cursor Agent can help with this pull request. Just |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
@cursor review |
|
Skipping Bugbot: Bugbot could not find a matching SCM installation for this repository. Please reinstall the GitHub/GitLab installation and/or remove the installation from non-Bugbot accounts. Visit the Bugbot dashboard to update your settings. |
There was a problem hiding this comment.
Thanks for fixing the AMD CI cache-dit issue — the amd_ci_install_dependency.sh change (pin + --upgrade) is correct and matches python/pyproject.toml.
However, the docker/rocm.Dockerfile change is most likely a no-op at image build time, because the very next RUN block (lines 264–284) renames python/pyproject_other.toml → python/pyproject.toml and runs pip install -e "python[srt_hip,diffusion_hip]". The diffusion_hip extra in python/pyproject_other.toml still pins cache-dit==1.1.8:
diffusion_hip = [
"sglang[diffusion_common]",
"peft>=0.18.0,<0.19.0", # Pin to <0.19.0 due to torchao incompatibility
"st_attn==0.0.7",
"vsa==0.0.4",
"runai_model_streamer>=0.15.5",
"cache-dit==1.1.8",
]So pip will downgrade cache-dit from 1.3.0 back to 1.1.8 right after your new pip install cache-dit==1.3.0 line. That's almost certainly why the AMD CI base image ships 1.1.8 today, and why the install-script --upgrade workaround is needed in the first place.
If the intent is for ROCm images to ship cache-dit==1.3.0 (matching python/pyproject.toml), the real fix is to bump the pin in python/pyproject_other.toml (diffusion_hip, and ideally also diffusion_musa) to 1.3.0. The same stale pin also lives in 3rdparty/amd/wheel/sglang/pyproject.toml and should be updated for consistency. Once those are bumped, the Dockerfile line you added becomes redundant.
Recommendation:
- Keep the
amd_ci_install_dependency.shchange as-is (good defensive runtime upgrade). - Either (a) bump
cache-dit==1.1.8 → 1.3.0inpython/pyproject_other.toml(diffusion_hip) and3rdparty/amd/wheel/sglang/pyproject.toml(diffusion_hip) and drop the Dockerfile line, or (b) drop the Dockerfile change entirely since it's overwritten downstream.
Sent by Cursor Automation: Bingxu’s Automation
| && pip install torchao==0.9.0 \ | ||
| && pip install pybind11 | ||
| && pip install pybind11 \ | ||
| && pip install cache-dit==1.3.0 |
There was a problem hiding this comment.
This install is overwritten a few lines below. The next RUN block does:
rm -rf python/pyproject.toml && mv python/pyproject_other.toml python/pyproject.toml
...
python -m pip --no-cache-dir install -e "python[srt_hip,diffusion_hip]"
and diffusion_hip in python/pyproject_other.toml still pins cache-dit==1.1.8, so pip will downgrade cache-dit from 1.3.0 back to 1.1.8 during the sglang editable install. The net effect is that the built image still ships cache-dit==1.1.8, which is exactly why the amd_ci_install_dependency.sh upgrade is required at job start.
To make this line actually do something (or to render it unnecessary), bump cache-dit==1.1.8 → 1.3.0 in python/pyproject_other.toml (diffusion_hip extra, and likely also diffusion_musa) and in 3rdparty/amd/wheel/sglang/pyproject.toml. Otherwise, this line can be dropped from the PR.
|
|
||
| # Install cache-dit for qwen_image_t2i_cache_dit_enabled test (added in PR 16204) | ||
| docker exec ci_sglang pip install --cache-dir=/sgl-data/pip-cache cache-dit || echo "cache-dit installation failed" | ||
| docker exec ci_sglang pip install --cache-dir=/sgl-data/pip-cache --upgrade 'cache-dit==1.3.0' || echo "cache-dit installation failed" |
There was a problem hiding this comment.
This change LGTM — pinning to 1.3.0 matches python/pyproject.toml, and --upgrade is needed because the AMD CI base image ships cache-dit==1.1.8 (since python/pyproject_other.toml and 3rdparty/amd/wheel/sglang/pyproject.toml still pin 1.1.8 for diffusion_hip). Once those pyproject pins are bumped, this --upgrade step will become a no-op but won't hurt to keep as a safety net.
This revert reverses the previous Dockerfile change after review pointed out it was a no-op: the next RUN block (rocm.Dockerfile:281) installs 'python[srt_hip,diffusion_hip]', and the diffusion_hip extra in python/pyproject_other.toml pins cache-dit==1.1.8, so pip downgrades the 1.3.0 we just installed. The real fix is to bump the pin in the toml files (next two commits). Co-authored-by: Bingxu Chen <Bingxu.Chen@amd.com>
…_hip
This is the real source of the cache-dit==1.1.8 in AMD CI images: the
ROCm Dockerfile renames pyproject_other.toml -> pyproject.toml and
installs the diffusion_hip extra, which used to pin 1.1.8. Bumping to
1.3.0 aligns with python/pyproject.toml and fixes R10
(multimodal-gen-test-{1,2}-gpu-amd[-rocm720]) at the source.
Co-authored-by: Bingxu Chen <Bingxu.Chen@amd.com>
Same fix in the amd-sglang wheel pyproject so wheel-built images stay in sync with python/pyproject.toml and python/pyproject_other.toml. Co-authored-by: Bingxu Chen <Bingxu.Chen@amd.com>
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |


Motivation
Fix R10 in sglang-ci-bot daily report bingxche/sglang-ci-bot#67: every
multimodal-gen-test-{1,2}-gpu-amd[-rocm720]job has been red for 2+ days (~4 jobs/run) with two paired symptoms that share one root cause:qwen_image_t2i_cache_dit_scm_config_diffusers_1gpu):wan2_1_t2vparametrization):Reproduced in the most recent scheduled run on
main(run 25644554985 / job 75270857375, 2026-05-11 00:48 UTC) — first failing log line is verbatim theRuntimeErrorabove.Why now (2 days, not 4 months)
The hard
cache-dit>=1.2.0runtime check was added back in #16662 (2026-01-22), andpython/pyproject.tomlwas bumped to1.3.0in #20361 (2026-03-17). AMD CI stayed green that whole time because no AMD test actually exercised--cache-dit-config. That changed when #19213 "[diffusion] CI: add cache-dit CI tests" landed on 2026-05-10 05:38 UTC (~49 h before this PR), which added the newqwen_image_t2i_cache_dit_scm_config_diffusers_1gpucase that does pass--cache-dit-config. From that moment on, the version skew between the AMD ROCm image (cache-dit==1.1.8) andpython/pyproject.toml(==1.3.0) became a hard failure.Root cause (verified against the tree)
The
cache-dit==1.1.8shipped in AMD CI images is not inherited from a base image — it is actively installed by the ROCm Dockerfile itself. The chain:docker/rocm.Dockerfile:279-281doesmv python/pyproject_other.toml python/pyproject.toml && pip install -e "python[srt_hip,diffusion_hip]".python/pyproject_other.toml:106(thediffusion_hipextra) pinnedcache-dit==1.1.8.python/pyproject.toml(which pins 1.3.0).python/sglang/multimodal_gen/runtime/cache/cache_dit_integration.py:34doesfrom cache_dit.parallelism import ParallelismBackend, ParallelismConfigand usesParallelismBackend.AUTO— both added in cache-dit 1.2.0+. Hence the runtime failure.scripts/ci/amd/amd_ci_install_dependency.sh:172ranpip install cache-dit(no--upgrade, no version), which is a no-op against the preinstalled 1.1.8 — so the script could not paper over the toml mismatch either.A previous revision of this PR added
pip install cache-dit==1.3.0directly to the Dockerfile, but code review by Cursor Automation correctly flagged that change as a no-op: the very nextRUNblock (line 281) re-installsdiffusion_hip, which downgrades 1.3.0 back to 1.1.8. That commit has been reverted; this PR now fixes the source of the pin instead.Why NVIDIA was unaffected
NVIDIA CI installs the diffusion extra via
pip install -e "python[dev,runai,tracing,diffusion]"(scripts/ci/cuda/ci_install_dependency.sh:227-232), which resolves againstpython/pyproject.toml(already at 1.3.0). The AMD path resolves againstpython/pyproject_other.toml, which had drifted — that's the bypass this PR closes.Modifications
Three small surgical edits, plus a preserved bridge fix in the install script:
python/pyproject_other.toml— bumpdiffusion_hipfromcache-dit==1.1.8tocache-dit==1.3.0. This is the actual source of truth that the ROCm Dockerfile resolves at image build time.3rdparty/amd/wheel/sglang/pyproject.toml— bumpdiffusion_hipfromcache-dit==1.1.8tocache-dit==1.3.0, so theamd-sglangwheel stays in sync.scripts/ci/amd/amd_ci_install_dependency.sh— changepip install cache-dittopip install --upgrade 'cache-dit==1.3.0'. This is a bridge fix: any AMD CI runner still pulling an image built before the toml bump lands will be force-upgraded at job start. Once new images are built (≤24 h via the nightly image release workflows), this line becomes a no-op (pip sees 1.3.0 already installed and exits).Out of scope but worth noting:
diffusion_musain bothpython/pyproject_other.toml:131and3rdparty/amd/wheel/sglang/pyproject.toml:146still pinscache-dit==1.1.8. The MUSA CI is a separate pipeline (pr-test-musa.yml) and not in this hotfix's blast radius. Tracked as a follow-up.5 commits, 5 net lines changed across 3 files (the previous revision's Dockerfile edit was reverted in commit
723fdd3f).Validation criteria
https://github.com/sgl-project/sglang/actions/runs/25648379091

Test passed.
Accuracy Tests
N/A — install-only change, no model code touched.
Speed Tests and Profiling
N/A — install-only change.
Checklist
python/pyproject.toml(single source of truth) acrosspyproject_other.tomland the amd-sglang wheel pyproject.Follow-up (out of scope here)
A cleaner architectural fix would be to have
scripts/ci/amd/amd_ci_install_dependency.shinstall viapip install -e "python[diffusion]"(the way NVIDIA CI does), removing the hand-rolledcache-dit / accelerate / pytest / huggingface_hublines entirely. That would makepython/pyproject.tomlthe single source of truth for AMD too. Also,diffusion_musashould be bumped to 1.3.0 for symmetry. Tracked separately, not part of this hotfix.Review and Merge Process
/rerun-failed-cior/tag-and-rerun-ci).