Skip to content

[CI] Add PID namespace and ps auxf diagnostics to killall.py#21401

Merged
hnyls2002 merged 9 commits intomainfrom
lsyin/add-aux-info
Mar 26, 2026
Merged

[CI] Add PID namespace and ps auxf diagnostics to killall.py#21401
hnyls2002 merged 9 commits intomainfrom
lsyin/add-aux-info

Conversation

@hnyls2002
Copy link
Copy Markdown
Collaborator

Summary

  • When killall.py fails to kill a PID reported by nvidia-smi, log whether the PID belongs to a different PID namespace (i.e. another container on the same host)
  • Dump filtered ps auxf output on kill failure to show what processes are actually running in the current container
  • Helps diagnose the recurring CI issue where GPU memory stays dirty but the reported PID cannot be killed (ProcessLookupError), which happens when runner containers share GPUs via --gpus all with soft CUDA_VISIBLE_DEVICES isolation

Test plan

  • Verify on a CI run where GPU cleanup fails — check that namespace info and ps output appear in the log
  • Verify on a clean CI run — no diagnostic output should appear

When kill fails on a PID reported by nvidia-smi, log whether the PID
is in a different PID namespace (indicating it belongs to another
container on the same host). Also dump filtered ps auxf output to
show what processes are actually running in the current container.

This helps diagnose cases where GPU memory remains dirty but the
reported PID cannot be killed (ProcessLookupError), which happens
when multiple CI runner containers share GPUs via --gpus all with
soft CUDA_VISIBLE_DEVICES isolation.
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the diagnostic capabilities of the killall.py script, particularly in Continuous Integration (CI) environments where GPU memory cleanup can be problematic. By introducing checks for PID namespaces and detailed process listing upon kill failures, the changes aim to provide clearer insights into why processes cannot be terminated, thereby streamlining the debugging of persistent CI issues related to GPU resource management.

Highlights

  • Enhanced PID Namespace Diagnostics: Implemented a new function, _check_pid_namespace, to determine if a process ID (PID) belongs to the current PID namespace when killall.py fails to terminate it. This helps diagnose issues where processes might be running in different containers.
  • Detailed Process Diagnostics: Added a _log_ps_diagnostic function to dump filtered ps auxf output when killall.py encounters a failure. This provides insight into what processes are actively running in the current container, aiding in debugging GPU memory cleanup problems.
  • Improved Error Logging in _kill_pids: Modified the _kill_pids function to call the new diagnostic functions (_check_pid_namespace and _log_ps_diagnostic) specifically when a ProcessLookupError or PermissionError occurs during a kill attempt. This ensures more comprehensive logging for failed process terminations.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the killall.py script by adding new diagnostic capabilities. It introduces _check_pid_namespace to verify process PID namespaces and _log_ps_diagnostic to capture filtered ps auxf output. These functions are integrated into _kill_pids to provide more detailed logging and troubleshooting information when os.kill operations fail. Feedback suggests improving error handling in _check_pid_namespace by catching OSError for broader robustness and enhancing _log_ps_diagnostic to better handle ps command failures by using check=True and logging stderr for improved debugging.

hnyls2002 and others added 7 commits March 25, 2026 03:49
- Box only shows kill summary (success/fail per PID)
- Namespace check and ps auxf diagnostics print after the box
- Retry loop silently retries kills, collects unkillable PIDs
- Unkillable PID summary shown in box, details after
…U memory check

- Add _find_sglang_pids_by_name() that scans /proc/*/cmdline for SGLang
  process patterns (matching killall_sglang.sh's pgrep patterns), catching
  processes not visible to nvidia-smi (e.g. stuck before CUDA init)
- Kill name-matched processes before GPU PID kill in _ci_mode()
- Merge _check_gpu_memory() and _log_gpu_memory() into one function with
  a log= parameter to eliminate code duplication
…ty guard, always output ps diagnostic

- Record actual exception type (ProcessLookupError/PermissionError) for unkillable PIDs instead of assuming "different namespace"
- Add pid > 1 and self-pid safety guard to retry loop inline kill
- Always output ps auxf diagnostic on GPU dirty failure, not only when unkillable PIDs exist
@hnyls2002 hnyls2002 force-pushed the lsyin/add-aux-info branch from 8d77aa1 to a0075a6 Compare March 26, 2026 04:55
@hnyls2002 hnyls2002 merged commit 79db3be into main Mar 26, 2026
124 of 165 checks passed
@hnyls2002 hnyls2002 deleted the lsyin/add-aux-info branch March 26, 2026 06:57
satyamk7054 pushed a commit to satyamk7054/sglang that referenced this pull request Apr 3, 2026
JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant