Skip to content

ci: clean up stale-CUDA mooncake variant in install_extra_deps#23960

Merged
Kangyan-Zhou merged 2 commits intosgl-project:mainfrom
Kangyan-Zhou:fix_mooncake_stale_uninstall
Apr 29, 2026
Merged

ci: clean up stale-CUDA mooncake variant in install_extra_deps#23960
Kangyan-Zhou merged 2 commits intosgl-project:mainfrom
Kangyan-Zhou:fix_mooncake_stale_uninstall

Conversation

@Kangyan-Zhou
Copy link
Copy Markdown
Collaborator

Summary

  • Defensively uninstall the opposite-CUDA mooncake-transfer-engine variant before installing the right one in install_extra_deps, so CI runners self-heal from prior stale-base PR infections.

Background

PR #23119 (6ecd6f84d, merged 2026-04-19 12:32 UTC) added the CU 13 conditional that picks mooncake-transfer-engine-cuda13 instead of mooncake-transfer-engine for CUDA 13 builds.

PRs whose base commit predates that merge still run the old ci_install_dependency.sh from their own checkout, which unconditionally installs mooncake-transfer-engine. Both packages own the same mooncake/ Python package directory, so the wrong variant lingers on the runner alongside the right one (and can take precedence depending on install order).

A fleet-wide audit on 2026-04-28 found the non-cuda13 variant on ~60 containers across all CI hosts. 17 of those were "post-cutoff" infections traced to 11 distinct stale PRs (#19582, #20177, #21388, #21543, #21674, #22289, #22921, #22972, #23013, #23053, #23139), each based on a commit older than the cutoff.

Fix

install_extra_deps already picks MOONCAKE_PKG based on CU_MAJOR. Pick MOONCAKE_STALE_PKG symmetrically and pip uninstall it (best-effort) right before the install step. This is symmetric, so it also handles a hypothetical CUDA 13 → 12 rollback.

Test plan

  • Per-commit CI runs and installs cleanly with no mooncake variant lingering on the runner afterwards
  • Subsequent re-runs of stale-base PRs still keep the runner in a clean state once they cycle through any fresh-base PR

🤖 Generated with Claude Code

Stale-base PR runs (older than the 2026-04-19 CUDA 13 upgrade) install
mooncake-transfer-engine instead of mooncake-transfer-engine-cuda13.
Since both packages own the same mooncake/ Python package dir, the
infection lingered on every CI runner that had ever executed such a PR
(~60 containers across the fleet at audit time). Defensively uninstall
the opposite-CUDA variant before installing the correct one so the next
fresh-base run on each runner self-heals.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@Kangyan-Zhou
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

The previous revision uninstalled the stale opposite-CUDA mooncake
package but didn't reinstall the live one. Both packages own the same
bin/ scripts (mooncake_master, etc.) and Python files; pip uninstall
deletes those shared files, but the live variant's RECORD still claims
them so pip sees "already satisfied" on the next install and skips.
Result: missing /usr/local/bin/mooncake_master, breaking the hicache
storage mooncake test.

Detect the stale variant explicitly with `pip show`, and only when found
do uninstall + force-reinstall (--no-deps) so we restore shared files
without churning unrelated dependencies.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Kangyan-Zhou Kangyan-Zhou merged commit feec1ac into sgl-project:main Apr 29, 2026
87 of 97 checks passed
vguduruTT pushed a commit to vguduruTT/sglang that referenced this pull request May 2, 2026
…roject#23960)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant