[AMD] Pin cache-dit==1.3.0 in rocm.Dockerfile + AMD CI install script by bingxche · Pull Request #24924 · sgl-project/sglang

bingxche · 2026-05-11T03:05:17Z

Motivation

Fix R10 in sglang-ci-bot daily report bingxche/sglang-ci-bot#67: every multimodal-gen-test-{1,2}-gpu-amd[-rocm720] job has been red for 2+ days (~4 jobs/run) with two paired symptoms that share one root cause:

1-GPU (qwen_image_t2i_cache_dit_scm_config_diffusers_1gpu):

RuntimeError: cache-dit>=1.2.0 is required for --cache-dit-config. Please upgrade cache-dit.

2-GPU (wan2_1_t2v parametrization):

AttributeError: ParallelismBackend.AUTO

Reproduced in the most recent scheduled run on main (run 25644554985 / job 75270857375, 2026-05-11 00:48 UTC) — first failing log line is verbatim the RuntimeError above.

Why now (2 days, not 4 months)

The hard cache-dit>=1.2.0 runtime check was added back in #16662 (2026-01-22), and python/pyproject.toml was bumped to 1.3.0 in #20361 (2026-03-17). AMD CI stayed green that whole time because no AMD test actually exercised --cache-dit-config. That changed when #19213 "[diffusion] CI: add cache-dit CI tests" landed on 2026-05-10 05:38 UTC (~49 h before this PR), which added the new qwen_image_t2i_cache_dit_scm_config_diffusers_1gpu case that does pass --cache-dit-config. From that moment on, the version skew between the AMD ROCm image (cache-dit==1.1.8) and python/pyproject.toml (==1.3.0) became a hard failure.

Root cause (verified against the tree)

The cache-dit==1.1.8 shipped in AMD CI images is not inherited from a base image — it is actively installed by the ROCm Dockerfile itself. The chain:

docker/rocm.Dockerfile:279-281 does mv python/pyproject_other.toml python/pyproject.toml && pip install -e "python[srt_hip,diffusion_hip]".
python/pyproject_other.toml:106 (the diffusion_hip extra) pinned cache-dit==1.1.8.
So every newly built ROCm image gets 1.1.8 baked in, regardless of python/pyproject.toml (which pins 1.3.0).
python/sglang/multimodal_gen/runtime/cache/cache_dit_integration.py:34 does from cache_dit.parallelism import ParallelismBackend, ParallelismConfig and uses ParallelismBackend.AUTO — both added in cache-dit 1.2.0+. Hence the runtime failure.
scripts/ci/amd/amd_ci_install_dependency.sh:172 ran pip install cache-dit (no --upgrade, no version), which is a no-op against the preinstalled 1.1.8 — so the script could not paper over the toml mismatch either.

A previous revision of this PR added pip install cache-dit==1.3.0 directly to the Dockerfile, but code review by Cursor Automation correctly flagged that change as a no-op: the very next RUN block (line 281) re-installs diffusion_hip, which downgrades 1.3.0 back to 1.1.8. That commit has been reverted; this PR now fixes the source of the pin instead.

Why NVIDIA was unaffected

NVIDIA CI installs the diffusion extra via pip install -e "python[dev,runai,tracing,diffusion]" (scripts/ci/cuda/ci_install_dependency.sh:227-232), which resolves against python/pyproject.toml (already at 1.3.0). The AMD path resolves against python/pyproject_other.toml, which had drifted — that's the bypass this PR closes.

Modifications

Three small surgical edits, plus a preserved bridge fix in the install script:

python/pyproject_other.toml — bump diffusion_hip from cache-dit==1.1.8 to cache-dit==1.3.0. This is the actual source of truth that the ROCm Dockerfile resolves at image build time.
3rdparty/amd/wheel/sglang/pyproject.toml — bump diffusion_hip from cache-dit==1.1.8 to cache-dit==1.3.0, so the amd-sglang wheel stays in sync.
scripts/ci/amd/amd_ci_install_dependency.sh — change pip install cache-dit to pip install --upgrade 'cache-dit==1.3.0'. This is a bridge fix: any AMD CI runner still pulling an image built before the toml bump lands will be force-upgraded at job start. Once new images are built (≤24 h via the nightly image release workflows), this line becomes a no-op (pip sees 1.3.0 already installed and exits).

Out of scope but worth noting: diffusion_musa in both python/pyproject_other.toml:131 and 3rdparty/amd/wheel/sglang/pyproject.toml:146 still pins cache-dit==1.1.8. The MUSA CI is a separate pipeline (pr-test-musa.yml) and not in this hotfix's blast radius. Tracked as a follow-up.

5 commits, 5 net lines changed across 3 files (the previous revision's Dockerfile edit was reverted in commit 723fdd3f).

Validation criteria

https://github.com/sgl-project/sglang/actions/runs/25648379091

Test passed.

Accuracy Tests

N/A — install-only change, no model code touched.

Speed Tests and Profiling

N/A — install-only change.

Checklist

Surgical fix; no production code changed.
Version pin matches python/pyproject.toml (single source of truth) across pyproject_other.toml and the amd-sglang wheel pyproject.
Reverted the no-op Dockerfile edit after review feedback.
Bridge fix in install script keeps existing images green until next image rebuild.
Validated end-to-end (see Validation criteria above).

Follow-up (out of scope here)

A cleaner architectural fix would be to have scripts/ci/amd/amd_ci_install_dependency.sh install via pip install -e "python[diffusion]" (the way NVIDIA CI does), removing the hand-rolled cache-dit / accelerate / pytest / huggingface_hub lines entirely. That would make python/pyproject.toml the single source of truth for AMD too. Also, diffusion_musa should be bumped to 1.3.0 for symmetry. Tracked separately, not part of this hotfix.

Review and Merge Process

Ping Merge Oncalls.
CODEOWNERS approval (AMD CI / Docker).
Trigger AMD CI (/rerun-failed-ci or /tag-and-rerun-ci).
Merge once the listed jobs are green.

The AMD CI base image ships cache-dit==1.1.8, but multimodal_gen requires >=1.2.0 (uses cache_dit.parallelism.ParallelismBackend.AUTO, added in 1.2.0) and pyproject.toml pins ==1.3.0. The previous 'pip install cache-dit' (no --upgrade, no version) was a no-op against the preinstalled 1.1.8, which caused every multimodal-gen-test-{1,2}-gpu-amd[-rocm720] job to fail with: RuntimeError: cache-dit>=1.2.0 is required for --cache-dit-config. AttributeError: ParallelismBackend.AUTO Pin to 1.3.0 (matching python/pyproject.toml) and force --upgrade so pip replaces the image's stale 1.1.8. Co-authored-by: Bingxu Chen <Bingxu.Chen@amd.com>

Co-authored-by: Bingxu Chen <Bingxu.Chen@amd.com>

Match the install-script change so newly built ROCm images already ship cache-dit==1.3.0, instead of relying on the CI install script to upgrade the stale 1.1.8 from the base image at every job start. Co-authored-by: Bingxu Chen <Bingxu.Chen@amd.com>

cursor · 2026-05-11T03:05:18Z

Cursor Agent can help with this pull request. Just @cursor in comments and I'll start working on changes in this branch.
_{Learn more about Cursor Agents}

gemini-code-assist · 2026-05-11T03:05:21Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

bingxche · 2026-05-11T05:02:40Z

@cursor review

cursor · 2026-05-11T05:02:42Z

Skipping Bugbot: Bugbot could not find a matching SCM installation for this repository. Please reinstall the GitHub/GitLab installation and/or remove the installation from non-Bugbot accounts. Visit the Bugbot dashboard to update your settings.

cursor

Thanks for fixing the AMD CI cache-dit issue — the amd_ci_install_dependency.sh change (pin + --upgrade) is correct and matches python/pyproject.toml.

However, the docker/rocm.Dockerfile change is most likely a no-op at image build time, because the very next RUN block (lines 264–284) renames python/pyproject_other.toml → python/pyproject.toml and runs pip install -e "python[srt_hip,diffusion_hip]". The diffusion_hip extra in python/pyproject_other.toml still pins cache-dit==1.1.8:

diffusion_hip = [
  "sglang[diffusion_common]",
  "peft>=0.18.0,<0.19.0", # Pin to <0.19.0 due to torchao incompatibility
  "st_attn==0.0.7",
  "vsa==0.0.4",
  "runai_model_streamer>=0.15.5",
  "cache-dit==1.1.8",
]

So pip will downgrade cache-dit from 1.3.0 back to 1.1.8 right after your new pip install cache-dit==1.3.0 line. That's almost certainly why the AMD CI base image ships 1.1.8 today, and why the install-script --upgrade workaround is needed in the first place.

If the intent is for ROCm images to ship cache-dit==1.3.0 (matching python/pyproject.toml), the real fix is to bump the pin in python/pyproject_other.toml (diffusion_hip, and ideally also diffusion_musa) to 1.3.0. The same stale pin also lives in 3rdparty/amd/wheel/sglang/pyproject.toml and should be updated for consistency. Once those are bumped, the Dockerfile line you added becomes redundant.

Recommendation:

Keep the amd_ci_install_dependency.sh change as-is (good defensive runtime upgrade).
Either (a) bump cache-dit==1.1.8 → 1.3.0 in python/pyproject_other.toml (diffusion_hip) and 3rdparty/amd/wheel/sglang/pyproject.toml (diffusion_hip) and drop the Dockerfile line, or (b) drop the Dockerfile change entirely since it's overwritten downstream.

_{Sent by Cursor Automation: Bingxu’s Automation}

cursor · 2026-05-11T05:05:10Z

    && pip install torchao==0.9.0 \
-    && pip install pybind11
+    && pip install pybind11 \
+    && pip install cache-dit==1.3.0


This install is overwritten a few lines below. The next RUN block does:

rm -rf python/pyproject.toml && mv python/pyproject_other.toml python/pyproject.toml ... python -m pip --no-cache-dir install -e "python[srt_hip,diffusion_hip]"

and diffusion_hip in python/pyproject_other.toml still pins cache-dit==1.1.8, so pip will downgrade cache-dit from 1.3.0 back to 1.1.8 during the sglang editable install. The net effect is that the built image still ships cache-dit==1.1.8, which is exactly why the amd_ci_install_dependency.sh upgrade is required at job start.

To make this line actually do something (or to render it unnecessary), bump cache-dit==1.1.8 → 1.3.0 in python/pyproject_other.toml (diffusion_hip extra, and likely also diffusion_musa) and in 3rdparty/amd/wheel/sglang/pyproject.toml. Otherwise, this line can be dropped from the PR.

cursor · 2026-05-11T05:05:10Z


  # Install cache-dit for qwen_image_t2i_cache_dit_enabled test (added in PR 16204)
-  docker exec ci_sglang pip install --cache-dir=/sgl-data/pip-cache cache-dit || echo "cache-dit installation failed"
+  docker exec ci_sglang pip install --cache-dir=/sgl-data/pip-cache --upgrade 'cache-dit==1.3.0' || echo "cache-dit installation failed"


This change LGTM — pinning to 1.3.0 matches python/pyproject.toml, and --upgrade is needed because the AMD CI base image ships cache-dit==1.1.8 (since python/pyproject_other.toml and 3rdparty/amd/wheel/sglang/pyproject.toml still pin 1.1.8 for diffusion_hip). Once those pyproject pins are bumped, this --upgrade step will become a no-op but won't hurt to keep as a safety net.

This revert reverses the previous Dockerfile change after review pointed out it was a no-op: the next RUN block (rocm.Dockerfile:281) installs 'python[srt_hip,diffusion_hip]', and the diffusion_hip extra in python/pyproject_other.toml pins cache-dit==1.1.8, so pip downgrades the 1.3.0 we just installed. The real fix is to bump the pin in the toml files (next two commits). Co-authored-by: Bingxu Chen <Bingxu.Chen@amd.com>

…_hip This is the real source of the cache-dit==1.1.8 in AMD CI images: the ROCm Dockerfile renames pyproject_other.toml -> pyproject.toml and installs the diffusion_hip extra, which used to pin 1.1.8. Bumping to 1.3.0 aligns with python/pyproject.toml and fixes R10 (multimodal-gen-test-{1,2}-gpu-amd[-rocm720]) at the source. Co-authored-by: Bingxu Chen <Bingxu.Chen@amd.com>

Same fix in the amd-sglang wheel pyproject so wheel-built images stay in sync with python/pyproject.toml and python/pyproject_other.toml. Co-authored-by: Bingxu Chen <Bingxu.Chen@amd.com>

gemini-code-assist · 2026-05-11T14:08:32Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

cursoragent and others added 3 commits May 11, 2026 02:49

[AMD CI] trim verbose cache-dit install comments

fba6ac9

Co-authored-by: Bingxu Chen <Bingxu.Chen@amd.com>

github-actions Bot added the amd label May 11, 2026

cursor Bot reviewed May 11, 2026

View reviewed changes

cursoragent and others added 3 commits May 11, 2026 05:09

[AMD] bump cache-dit 1.1.8 -> 1.3.0 in amd-sglang wheel diffusion_hip

2774de0

Same fix in the amd-sglang wheel pyproject so wheel-built images stay in sync with python/pyproject.toml and python/pyproject_other.toml. Co-authored-by: Bingxu Chen <Bingxu.Chen@amd.com>

github-actions Bot added the dependencies Pull requests that update a dependency file label May 11, 2026

bingxche marked this pull request as ready for review May 11, 2026 14:08

yctseng0211 approved these changes May 11, 2026

View reviewed changes

yctseng0211 merged commit aeb8fef into main May 11, 2026
70 of 74 checks passed

yctseng0211 deleted the bingxche/fix-amd-ci-cache-dit-upgrade-bd1b branch May 11, 2026 14:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] Pin cache-dit==1.3.0 in rocm.Dockerfile + AMD CI install script#24924

[AMD] Pin cache-dit==1.3.0 in rocm.Dockerfile + AMD CI install script#24924
yctseng0211 merged 6 commits into
mainfrom
bingxche/fix-amd-ci-cache-dit-upgrade-bd1b

bingxche commented May 11, 2026 •

edited

Loading

Uh oh!

cursor Bot commented May 11, 2026

Uh oh!

gemini-code-assist Bot commented May 11, 2026

Uh oh!

bingxche commented May 11, 2026

Uh oh!

cursor Bot commented May 11, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot May 11, 2026

Uh oh!

cursor Bot May 11, 2026

Uh oh!

gemini-code-assist Bot commented May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

bingxche commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Why now (2 days, not 4 months)

Root cause (verified against the tree)

Why NVIDIA was unaffected

Modifications

Validation criteria

Accuracy Tests

Speed Tests and Profiling

Checklist

Follow-up (out of scope here)

Review and Merge Process

Uh oh!

cursor Bot commented May 11, 2026

Uh oh!

gemini-code-assist Bot commented May 11, 2026

Uh oh!

bingxche commented May 11, 2026

Uh oh!

cursor Bot commented May 11, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot commented May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bingxche commented May 11, 2026 •

edited

Loading