[CI] sgl-kernel: prune dangling images before each wheel build#23723
Merged
Conversation
build.sh retags sgl-kernel-deps:cuda${CUDA_VERSION}-${PY_TAG}-${ARCH} on
every run, so each successful run leaves the previous image as a dangling
<none>:<none> entry (~16-23 GB). Over many CI runs this fills the
self-hosted runner's disk, causing failures like "no space left on device"
during `--load` (e.g. PR #23720).
Add a cleanup step before the build that runs `docker image prune -f
--filter until=1h` and drops orphan buildx builder volumes that aren't
the active sgl-kernel-builder. The local buildx cache at
~/.cache/sgl-kernel/buildx, ccache, the tagged sgl-kernel-deps image,
and the live builder volume are all preserved, so incremental rebuild
speed is unchanged. The `until=1h` filter keeps a sibling matrix cell
(cuda 12.9 vs 13.0) from racing on a freshly-orphaned image.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
|
Note Gemini is unable to generate a review for this pull request due to the file types involved not being currently supported. |
Address review feedback: be more conservative about pruning recently- orphaned images. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
vguduruTT
pushed a commit
to vguduruTT/sglang
that referenced
this pull request
May 2, 2026
…roject#23723) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
LucQueen
pushed a commit
to LucQueen/sglang
that referenced
this pull request
May 12, 2026
…roject#23723) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
sgl-kernel/build.shcallsdocker buildx build --target deps --load -t sgl-kernel-deps:cuda\${CUDA_VERSION}-\${PY_TAG}-\${ARCH}on every run. Because the tag is reused, each successful build retags the new image and orphans the previous one as a dangling<none>:<none>entry (~16-23 GB each). Over many CI runs this fills the self-hosted runner's disk and breaks the build.Example failure on PR #23720 (job
Build Wheel Arm (3.10, 12.9)): https://github.com/sgl-project/sglang/actions/runs/24935867767/job/73021376404 — runner warnsFree space left: 79 MB, then--loaddies withimporting to docker: write /blobs/sha256/...: no space left on device. Inspection of the runner found 68 danglingsgl-kernel-depsimages consuming ~480 GB and 5 stalebuildx_buildkit_builder-<uuid>_statevolumes (~38 GB) from olddocker buildx createinvocations.This PR adds a "Free Docker disk space" step before the build in both
sgl-kernel-build-wheelsandsgl-kernel-build-wheels-arm:docker image prune -f --filter "until=1h"removes only dangling images older than 1 hour. The 1-hour filter avoids racing with a sibling matrix cell (e.g. cuda 12.9 vs 13.0) that may have just orphaned an image moments earlier — the disk-fill is cumulative across days, not minutes, so this is plenty fresh.buildx_buildkit_*volumes that don't belong to the active named buildersgl-kernel-builder(whichbuild.shalways uses).Explicitly not touched, so incremental rebuild speed is unchanged:
~/.cache/sgl-kernel/buildx/— the--cache-from/--cache-to type=localdirectory used by build.sh.~/.cache/sgl-kernel/ccache/— host-mounted into the build container.sgl-kernel-deps:*image —docker image prune -fonly removes untagged dangling images.buildx_buildkit_sgl-kernel-builder*volume — explicitly skipped by the regex.Test plan
df -h /after pruning, and reports significant freed space on the first run.🤖 Generated with Claude Code