-
Notifications
You must be signed in to change notification settings - Fork 637
feat: add MPI RUN prefix to trtllm launch scripts #3546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
👋 Hi zhjunqin! Thank you for contributing to ai-dynamo/dynamo. Just a reminder: The 🚀 |
WalkthroughAdds an optional MPI_CMD environment variable (default empty) across TRT-LLM launch scripts and prepends it to python invocations, enabling MPI-wrapped execution of dynamo.trtllm workers without altering other control flow. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant U as Operator
participant L as Launch Script
participant M as MPI (optional)
participant P as dynamo.trtllm (python)
U->>L: Run launch script
alt MPI_CMD set
L->>M: Invoke "$MPI_CMD python3 -m dynamo.trtllm ..."
M->>P: Start module with provided args
else MPI_CMD empty
L->>P: Invoke "python3 -m dynamo.trtllm ..."
end
P-->>L: Worker lifecycle (encode/prefill/decode)
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
Pre-merge checks✅ Passed checks (3 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
components/backends/trtllm/launch/gpt_oss_disagg.sh (1)
22-35
: Same MPI env propagation caveat as noted in disagg_router.shIf using OpenMPI across nodes, include
-x CUDA_VISIBLE_DEVICES
in MPI_CMD to ensure ranks inherit GPU masks.Also applies to: 37-48
🧹 Nitpick comments (5)
components/backends/trtllm/launch/disagg_router.sh (1)
39-45
: Ensure env vars propagate to MPI ranks (OpenMPI needs -x)With
CUDA_VISIBLE_DEVICES=... $MPI_CMD python3 ...
, the inline env is set for the launcher. On OpenMPI across nodes, ranks may not inherit unless you pass-x CUDA_VISIBLE_DEVICES
(and any other inline vars) in MPI_CMD. Example:
MPI_CMD='mpirun -np 2 -x CUDA_VISIBLE_DEVICES'
Action:
- Document this requirement in README or comments here, or ensure users include the proper
-x
flags when setting MPI_CMD. For Slurmsrun
, env usually propagates by default.Would you like me to add a brief comment/example to these scripts?
Also applies to: 49-55
components/backends/trtllm/launch/agg.sh (1)
29-33
: Optional: add strict mode for consistencyConsider
set -euo pipefail
(like in gpt_oss_disagg.sh usesset -e
) to fail fast on errors in all launch scripts.components/backends/trtllm/launch/epd_disagg.sh (1)
36-45
: Propagate env to MPI ranksInline
CUDA_VISIBLE_DEVICES=... $MPI_CMD python3 ...
sets the var for the launcher; ranks may not inherit on OpenMPI unless-x CUDA_VISIBLE_DEVICES
is included in MPI_CMD. Recommend documenting this expectation for multi-node runs.Also applies to: 48-56, 59-66
components/backends/trtllm/launch/agg_metrics.sh (1)
26-32
: Propagate DYN_SYSTEM_ to MPI ranks*
DYN_SYSTEM_ENABLED=true DYN_SYSTEM_PORT=8081 $MPI_CMD ...
sets env for the launcher. On OpenMPI, add-x DYN_SYSTEM_ENABLED -x DYN_SYSTEM_PORT
to MPI_CMD if running across nodes so ranks see these.components/backends/trtllm/launch/disagg.sh (1)
33-40
: Propagate CUDA_VISIBLE_DEVICES to MPI ranksFor OpenMPI multi-node runs, ensure
MPI_CMD
includes-x CUDA_VISIBLE_DEVICES
so ranks inherit the GPU mask (or document this requirement).Also applies to: 43-49
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
components/backends/trtllm/launch/agg.sh
(2 hunks)components/backends/trtllm/launch/agg_metrics.sh
(2 hunks)components/backends/trtllm/launch/agg_router.sh
(2 hunks)components/backends/trtllm/launch/disagg.sh
(3 hunks)components/backends/trtllm/launch/disagg_router.sh
(3 hunks)components/backends/trtllm/launch/epd_disagg.sh
(4 hunks)components/backends/trtllm/launch/gpt_oss_disagg.sh
(3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (8)
components/backends/trtllm/launch/disagg_router.sh (1)
13-13
: MPI_CMD default looks goodOptional MPI wrapper without changing defaults. LGTM.
components/backends/trtllm/launch/agg.sh (1)
12-12
: MPI_CMD default looks goodOptional MPI wrapper without changing defaults. LGTM.
components/backends/trtllm/launch/agg_router.sh (2)
9-9
: MPI_CMD default looks goodOptional MPI wrapper without changing defaults. LGTM.
26-30
: LGTMWorker launch correctly prefixed with MPI_CMD.
components/backends/trtllm/launch/gpt_oss_disagg.sh (1)
11-11
: MPI_CMD default looks goodOptional MPI wrapper without changing defaults. LGTM.
components/backends/trtllm/launch/epd_disagg.sh (1)
19-19
: MPI_CMD default looks goodOptional MPI wrapper without changing defaults. LGTM.
components/backends/trtllm/launch/agg_metrics.sh (1)
10-10
: MPI_CMD default looks goodOptional MPI wrapper without changing defaults. LGTM.
components/backends/trtllm/launch/disagg.sh (1)
16-16
: MPI_CMD default looks goodOptional MPI wrapper without changing defaults. LGTM.
Signed-off-by: Django Zhang <[email protected]>
ccdd71b
to
7e0f828
Compare
Overview:
Add $MPI_CMD to all trtllm launch sciprts.
Details:
To help trtllm launch sciprts can be run on slurm cluster easier. Just need to set the new envrionment MPI_CMD.
Default MPI_CMD to null, then it would change default behavior.
Where should the reviewer start?
N/A
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
N/A
Summary by CodeRabbit