ci: clean up stale-CUDA mooncake variant in install_extra_deps#23960
Merged
Kangyan-Zhou merged 2 commits intosgl-project:mainfrom Apr 29, 2026
Merged
ci: clean up stale-CUDA mooncake variant in install_extra_deps#23960Kangyan-Zhou merged 2 commits intosgl-project:mainfrom
Kangyan-Zhou merged 2 commits intosgl-project:mainfrom
Conversation
Stale-base PR runs (older than the 2026-04-19 CUDA 13 upgrade) install mooncake-transfer-engine instead of mooncake-transfer-engine-cuda13. Since both packages own the same mooncake/ Python package dir, the infection lingered on every CI runner that had ever executed such a PR (~60 containers across the fleet at audit time). Defensively uninstall the opposite-CUDA variant before installing the correct one so the next fresh-base run on each runner self-heals. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Collaborator
Author
|
/tag-and-rerun-ci |
The previous revision uninstalled the stale opposite-CUDA mooncake package but didn't reinstall the live one. Both packages own the same bin/ scripts (mooncake_master, etc.) and Python files; pip uninstall deletes those shared files, but the live variant's RECORD still claims them so pip sees "already satisfied" on the next install and skips. Result: missing /usr/local/bin/mooncake_master, breaking the hicache storage mooncake test. Detect the stale variant explicitly with `pip show`, and only when found do uninstall + force-reinstall (--no-deps) so we restore shared files without churning unrelated dependencies. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
vguduruTT
pushed a commit
to vguduruTT/sglang
that referenced
this pull request
May 2, 2026
…roject#23960) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
mooncake-transfer-enginevariant before installing the right one ininstall_extra_deps, so CI runners self-heal from prior stale-base PR infections.Background
PR #23119 (
6ecd6f84d, merged 2026-04-19 12:32 UTC) added the CU 13 conditional that picksmooncake-transfer-engine-cuda13instead ofmooncake-transfer-enginefor CUDA 13 builds.PRs whose base commit predates that merge still run the old
ci_install_dependency.shfrom their own checkout, which unconditionally installsmooncake-transfer-engine. Both packages own the samemooncake/Python package directory, so the wrong variant lingers on the runner alongside the right one (and can take precedence depending on install order).A fleet-wide audit on 2026-04-28 found the non-cuda13 variant on ~60 containers across all CI hosts. 17 of those were "post-cutoff" infections traced to 11 distinct stale PRs (#19582, #20177, #21388, #21543, #21674, #22289, #22921, #22972, #23013, #23053, #23139), each based on a commit older than the cutoff.
Fix
install_extra_depsalready picksMOONCAKE_PKGbased onCU_MAJOR. PickMOONCAKE_STALE_PKGsymmetrically andpip uninstallit (best-effort) right before the install step. This is symmetric, so it also handles a hypothetical CUDA 13 → 12 rollback.Test plan
🤖 Generated with Claude Code