[CI] Add PID namespace and ps auxf diagnostics to killall.py#21401
[CI] Add PID namespace and ps auxf diagnostics to killall.py#21401
Conversation
When kill fails on a PID reported by nvidia-smi, log whether the PID is in a different PID namespace (indicating it belongs to another container on the same host). Also dump filtered ps auxf output to show what processes are actually running in the current container. This helps diagnose cases where GPU memory remains dirty but the reported PID cannot be killed (ProcessLookupError), which happens when multiple CI runner containers share GPUs via --gpus all with soft CUDA_VISIBLE_DEVICES isolation.
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the diagnostic capabilities of the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request enhances the killall.py script by adding new diagnostic capabilities. It introduces _check_pid_namespace to verify process PID namespaces and _log_ps_diagnostic to capture filtered ps auxf output. These functions are integrated into _kill_pids to provide more detailed logging and troubleshooting information when os.kill operations fail. Feedback suggests improving error handling in _check_pid_namespace by catching OSError for broader robustness and enhancing _log_ps_diagnostic to better handle ps command failures by using check=True and logging stderr for improved debugging.
- Box only shows kill summary (success/fail per PID) - Namespace check and ps auxf diagnostics print after the box - Retry loop silently retries kills, collects unkillable PIDs - Unkillable PID summary shown in box, details after
…U memory check - Add _find_sglang_pids_by_name() that scans /proc/*/cmdline for SGLang process patterns (matching killall_sglang.sh's pgrep patterns), catching processes not visible to nvidia-smi (e.g. stuck before CUDA init) - Kill name-matched processes before GPU PID kill in _ci_mode() - Merge _check_gpu_memory() and _log_gpu_memory() into one function with a log= parameter to eliminate code duplication
…ty guard, always output ps diagnostic - Record actual exception type (ProcessLookupError/PermissionError) for unkillable PIDs instead of assuming "different namespace" - Add pid > 1 and self-pid safety guard to retry loop inline kill - Always output ps auxf diagnostic on GPU dirty failure, not only when unkillable PIDs exist
8d77aa1 to
a0075a6
Compare
Summary
killall.pyfails to kill a PID reported bynvidia-smi, log whether the PID belongs to a different PID namespace (i.e. another container on the same host)ps auxfoutput on kill failure to show what processes are actually running in the current containerProcessLookupError), which happens when runner containers share GPUs via--gpus allwith softCUDA_VISIBLE_DEVICESisolationTest plan