Skip to content

[CI] sgl-kernel: prune dangling images before each wheel build#23723

Merged
Kangyan-Zhou merged 2 commits into
mainfrom
ci-cleanup-build-wheel
Apr 25, 2026
Merged

[CI] sgl-kernel: prune dangling images before each wheel build#23723
Kangyan-Zhou merged 2 commits into
mainfrom
ci-cleanup-build-wheel

Conversation

@Kangyan-Zhou
Copy link
Copy Markdown
Collaborator

Summary

sgl-kernel/build.sh calls docker buildx build --target deps --load -t sgl-kernel-deps:cuda\${CUDA_VERSION}-\${PY_TAG}-\${ARCH} on every run. Because the tag is reused, each successful build retags the new image and orphans the previous one as a dangling <none>:<none> entry (~16-23 GB each). Over many CI runs this fills the self-hosted runner's disk and breaks the build.

Example failure on PR #23720 (job Build Wheel Arm (3.10, 12.9)): https://github.com/sgl-project/sglang/actions/runs/24935867767/job/73021376404 — runner warns Free space left: 79 MB, then --load dies with importing to docker: write /blobs/sha256/...: no space left on device. Inspection of the runner found 68 dangling sgl-kernel-deps images consuming ~480 GB and 5 stale buildx_buildkit_builder-<uuid>_state volumes (~38 GB) from old docker buildx create invocations.

This PR adds a "Free Docker disk space" step before the build in both sgl-kernel-build-wheels and sgl-kernel-build-wheels-arm:

  • docker image prune -f --filter "until=1h" removes only dangling images older than 1 hour. The 1-hour filter avoids racing with a sibling matrix cell (e.g. cuda 12.9 vs 13.0) that may have just orphaned an image moments earlier — the disk-fill is cumulative across days, not minutes, so this is plenty fresh.
  • A loop drops orphan buildx_buildkit_* volumes that don't belong to the active named builder sgl-kernel-builder (which build.sh always uses).

Explicitly not touched, so incremental rebuild speed is unchanged:

  • ~/.cache/sgl-kernel/buildx/ — the --cache-from/--cache-to type=local directory used by build.sh.
  • ~/.cache/sgl-kernel/ccache/ — host-mounted into the build container.
  • The actively-tagged sgl-kernel-deps:* image — docker image prune -f only removes untagged dangling images.
  • buildx_buildkit_sgl-kernel-builder* volume — explicitly skipped by the regex.

Test plan

  • Verify the new step runs on a self-hosted runner, prints df -h / after pruning, and reports significant freed space on the first run.
  • Verify the step on a clean runner (with no orphans yet) is a fast no-op.
  • Verify a follow-up CI run after this one builds incrementally as fast as before — i.e. the local buildx cache and ccache are still hit.

🤖 Generated with Claude Code

build.sh retags sgl-kernel-deps:cuda${CUDA_VERSION}-${PY_TAG}-${ARCH} on
every run, so each successful run leaves the previous image as a dangling
<none>:<none> entry (~16-23 GB). Over many CI runs this fills the
self-hosted runner's disk, causing failures like "no space left on device"
during `--load` (e.g. PR #23720).

Add a cleanup step before the build that runs `docker image prune -f
--filter until=1h` and drops orphan buildx builder volumes that aren't
the active sgl-kernel-builder. The local buildx cache at
~/.cache/sgl-kernel/buildx, ccache, the tagged sgl-kernel-deps image,
and the live builder volume are all preserved, so incremental rebuild
speed is unchanged. The `until=1h` filter keeps a sibling matrix cell
(cuda 12.9 vs 13.0) from racing on a freshly-orphaned image.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Note

Gemini is unable to generate a review for this pull request due to the file types involved not being currently supported.

Address review feedback: be more conservative about pruning recently-
orphaned images.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Kangyan-Zhou Kangyan-Zhou merged commit acaa356 into main Apr 25, 2026
41 of 43 checks passed
@Kangyan-Zhou Kangyan-Zhou deleted the ci-cleanup-build-wheel branch April 25, 2026 17:44
vguduruTT pushed a commit to vguduruTT/sglang that referenced this pull request May 2, 2026
…roject#23723)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
LucQueen pushed a commit to LucQueen/sglang that referenced this pull request May 12, 2026
…roject#23723)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant