[CI] Add PID namespace and ps auxf diagnostics to killall.py by hnyls2002 · Pull Request #21401 · sgl-project/sglang

hnyls2002 · 2026-03-25T10:30:51Z

Summary

When killall.py fails to kill a PID reported by nvidia-smi, log whether the PID belongs to a different PID namespace (i.e. another container on the same host)
Dump filtered ps auxf output on kill failure to show what processes are actually running in the current container
Helps diagnose the recurring CI issue where GPU memory stays dirty but the reported PID cannot be killed (ProcessLookupError), which happens when runner containers share GPUs via --gpus all with soft CUDA_VISIBLE_DEVICES isolation

Test plan

Verify on a CI run where GPU cleanup fails — check that namespace info and ps output appear in the log
Verify on a clean CI run — no diagnostic output should appear

When kill fails on a PID reported by nvidia-smi, log whether the PID is in a different PID namespace (indicating it belongs to another container on the same host). Also dump filtered ps auxf output to show what processes are actually running in the current container. This helps diagnose cases where GPU memory remains dirty but the reported PID cannot be killed (ProcessLookupError), which happens when multiple CI runner containers share GPUs via --gpus all with soft CUDA_VISIBLE_DEVICES isolation.

gemini-code-assist · 2026-03-25T10:31:03Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the diagnostic capabilities of the killall.py script, particularly in Continuous Integration (CI) environments where GPU memory cleanup can be problematic. By introducing checks for PID namespaces and detailed process listing upon kill failures, the changes aim to provide clearer insights into why processes cannot be terminated, thereby streamlining the debugging of persistent CI issues related to GPU resource management.

Highlights

Enhanced PID Namespace Diagnostics: Implemented a new function, _check_pid_namespace, to determine if a process ID (PID) belongs to the current PID namespace when killall.py fails to terminate it. This helps diagnose issues where processes might be running in different containers.
Detailed Process Diagnostics: Added a _log_ps_diagnostic function to dump filtered ps auxf output when killall.py encounters a failure. This provides insight into what processes are actively running in the current container, aiding in debugging GPU memory cleanup problems.
Improved Error Logging in _kill_pids: Modified the _kill_pids function to call the new diagnostic functions (_check_pid_namespace and _log_ps_diagnostic) specifically when a ProcessLookupError or PermissionError occurs during a kill attempt. This ensures more comprehensive logging for failed process terminations.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request enhances the killall.py script by adding new diagnostic capabilities. It introduces _check_pid_namespace to verify process PID namespaces and _log_ps_diagnostic to capture filtered ps auxf output. These functions are integrated into _kill_pids to provide more detailed logging and troubleshooting information when os.kill operations fail. Feedback suggests improving error handling in _check_pid_namespace by catching OSError for broader robustness and enhancing _log_ps_diagnostic to better handle ps command failures by using check=True and logging stderr for improved debugging.

python/sglang/cli/killall.py

- Box only shows kill summary (success/fail per PID) - Namespace check and ps auxf diagnostics print after the box - Retry loop silently retries kills, collects unkillable PIDs - Unkillable PID summary shown in box, details after

…U memory check - Add _find_sglang_pids_by_name() that scans /proc/*/cmdline for SGLang process patterns (matching killall_sglang.sh's pgrep patterns), catching processes not visible to nvidia-smi (e.g. stuck before CUDA init) - Kill name-matched processes before GPU PID kill in _ci_mode() - Merge _check_gpu_memory() and _log_gpu_memory() into one function with a log= parameter to eliminate code duplication

…ty guard, always output ps diagnostic - Record actual exception type (ProcessLookupError/PermissionError) for unkillable PIDs instead of assuming "different namespace" - Add pid > 1 and self-pid safety guard to retry loop inline kill - Always output ps auxf diagnostic on GPU dirty failure, not only when unkillable PIDs exist

…ject#21401)

hnyls2002 added high priority run-ci labels Mar 25, 2026

gemini-code-assist bot reviewed Mar 25, 2026

View reviewed changes

python/sglang/cli/killall.py Show resolved Hide resolved

python/sglang/cli/killall.py Outdated Show resolved Hide resolved

hnyls2002 and others added 7 commits March 25, 2026 03:49

[CI] Refactor killall.py for cleanliness

51fb3b1

[CI] Log nvidia-smi driver version and GPU name in killall.py

e16e24e

[CI] Document PID namespace safety in _find_sglang_pids_by_name

df8bcec

Merge branch 'main' into lsyin/add-aux-info

a0075a6

hnyls2002 force-pushed the lsyin/add-aux-info branch from 8d77aa1 to a0075a6 Compare March 26, 2026 04:55

Merge branch 'main' into lsyin/add-aux-info

b865859

hnyls2002 merged commit 79db3be into main Mar 26, 2026
124 of 165 checks passed

hnyls2002 deleted the lsyin/add-aux-info branch March 26, 2026 06:57

satyamk7054 pushed a commit to satyamk7054/sglang that referenced this pull request Apr 3, 2026

[CI] Add PID namespace and ps auxf diagnostics to killall.py (sgl-pro…

75e830f

…ject#21401)

JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026

[CI] Add PID namespace and ps auxf diagnostics to killall.py (sgl-pro…

02e62d1

…ject#21401)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Add PID namespace and ps auxf diagnostics to killall.py#21401

[CI] Add PID namespace and ps auxf diagnostics to killall.py#21401
hnyls2002 merged 9 commits intomainfrom
lsyin/add-aux-info

hnyls2002 commented Mar 25, 2026

Uh oh!

gemini-code-assist bot commented Mar 25, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hnyls2002 commented Mar 25, 2026

Summary

Test plan

Uh oh!

gemini-code-assist bot commented Mar 25, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant