feat: add MPI RUN prefix to trtllm launch scripts #3546

zhjunqin · 2025-10-10T09:03:57Z

Overview:

Add $MPI_CMD to all trtllm launch sciprts.

Details:

To help trtllm launch sciprts can be run on slurm cluster easier. Just need to set the new envrionment MPI_CMD.

Default MPI_CMD to null, then it would change default behavior.

Where should the reviewer start?

N/A

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

N/A

Summary by CodeRabbit

New Features
- Added optional MPI integration across TRT-LLM launch scripts (agg, agg_metrics, agg_router, disagg, disagg_router, epd_disagg, gpt_oss_disagg) via a new MPI_CMD environment variable.
- When MPI_CMD is set, worker processes (e.g., encode/prefill/decode) run under the specified MPI command; leaving it empty preserves existing behavior.
- Enables easier scaling with MPI without changing default launch workflows.

copy-pr-bot · 2025-10-10T09:04:02Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2025-10-10T09:04:06Z

👋 Hi zhjunqin! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

coderabbitai · 2025-10-10T09:09:42Z

Walkthrough

Adds an optional MPI_CMD environment variable (default empty) across TRT-LLM launch scripts and prepends it to python invocations, enabling MPI-wrapped execution of dynamo.trtllm workers without altering other control flow.

Changes

Cohort / File(s)	Summary of Changes
Aggregated launch scripts — MPI wrapper `components/backends/trtllm/launch/agg.sh`, `.../agg_metrics.sh`, `.../agg_router.sh`	Export MPI_CMD (default ""), and prefix worker launches with `$MPI_CMD python3 -m dynamo.trtllm ...` instead of plain python3.
Disaggregated launch scripts — MPI wrapper `components/backends/trtllm/launch/disagg.sh`, `.../disagg_router.sh`, `.../epd_disagg.sh`, `.../gpt_oss_disagg.sh`	Export MPI_CMD (default ""), and prepend `$MPI_CMD` to encode/prefill/decode worker invocations, keeping other arguments unchanged.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant U as Operator
  participant L as Launch Script
  participant M as MPI (optional)
  participant P as dynamo.trtllm (python)

  U->>L: Run launch script
  alt MPI_CMD set
    L->>M: Invoke "$MPI_CMD python3 -m dynamo.trtllm ..."
    M->>P: Start module with provided args
  else MPI_CMD empty
    L->>P: Invoke "python3 -m dynamo.trtllm ..."
  end
  P-->>L: Worker lifecycle (encode/prefill/decode)

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

I twitch my whiskers, press “mpirun” with glee,
A hop, a skip—now workers launch in harmony.
If empty, I scamper the usual way,
If set, I bound in parallel play.
Carrots queued, ranks aligned—
Cluster chorus, neatly timed. 🥕✨

Pre-merge checks

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title succinctly and accurately captures the primary change of adding an MPI run prefix to all trtllm launch scripts, making it clear and specific without extraneous detail or vague language.
Description Check	✅ Passed	The pull request description adheres to the repository’s template by including the Overview, Details, Where should the reviewer start, and Related Issues sections, and it clearly explains the purpose and change even though some entries are minimal with “N/A” placeholders.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

components/backends/trtllm/launch/gpt_oss_disagg.sh (1)

22-35: Same MPI env propagation caveat as noted in disagg_router.sh

If using OpenMPI across nodes, include -x CUDA_VISIBLE_DEVICES in MPI_CMD to ensure ranks inherit GPU masks.

Also applies to: 37-48

🧹 Nitpick comments (5)

components/backends/trtllm/launch/disagg_router.sh (1)

39-45: Ensure env vars propagate to MPI ranks (OpenMPI needs -x)

With CUDA_VISIBLE_DEVICES=... $MPI_CMD python3 ..., the inline env is set for the launcher. On OpenMPI across nodes, ranks may not inherit unless you pass -x CUDA_VISIBLE_DEVICES (and any other inline vars) in MPI_CMD. Example:

MPI_CMD='mpirun -np 2 -x CUDA_VISIBLE_DEVICES'

Action:

Document this requirement in README or comments here, or ensure users include the proper -x flags when setting MPI_CMD. For Slurm srun, env usually propagates by default.

Would you like me to add a brief comment/example to these scripts?

Also applies to: 49-55

components/backends/trtllm/launch/agg.sh (1)

29-33: Optional: add strict mode for consistency

Consider set -euo pipefail (like in gpt_oss_disagg.sh uses set -e) to fail fast on errors in all launch scripts.

components/backends/trtllm/launch/epd_disagg.sh (1)

36-45: Propagate env to MPI ranks

Inline CUDA_VISIBLE_DEVICES=... $MPI_CMD python3 ... sets the var for the launcher; ranks may not inherit on OpenMPI unless -x CUDA_VISIBLE_DEVICES is included in MPI_CMD. Recommend documenting this expectation for multi-node runs.

Also applies to: 48-56, 59-66

components/backends/trtllm/launch/agg_metrics.sh (1)

26-32: Propagate DYN_SYSTEM_ to MPI ranks*

DYN_SYSTEM_ENABLED=true DYN_SYSTEM_PORT=8081 $MPI_CMD ... sets env for the launcher. On OpenMPI, add -x DYN_SYSTEM_ENABLED -x DYN_SYSTEM_PORT to MPI_CMD if running across nodes so ranks see these.

components/backends/trtllm/launch/disagg.sh (1)

33-40: Propagate CUDA_VISIBLE_DEVICES to MPI ranks

For OpenMPI multi-node runs, ensure MPI_CMD includes -x CUDA_VISIBLE_DEVICES so ranks inherit the GPU mask (or document this requirement).

Also applies to: 43-49

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0a2a820 and ccdd71b.

📒 Files selected for processing (7)

components/backends/trtllm/launch/agg.sh (2 hunks)
components/backends/trtllm/launch/agg_metrics.sh (2 hunks)
components/backends/trtllm/launch/agg_router.sh (2 hunks)
components/backends/trtllm/launch/disagg.sh (3 hunks)
components/backends/trtllm/launch/disagg_router.sh (3 hunks)
components/backends/trtllm/launch/epd_disagg.sh (4 hunks)
components/backends/trtllm/launch/gpt_oss_disagg.sh (3 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Build and Test - dynamo

🔇 Additional comments (8)

components/backends/trtllm/launch/disagg_router.sh (1)

13-13: MPI_CMD default looks good

Optional MPI wrapper without changing defaults. LGTM.

components/backends/trtllm/launch/agg.sh (1)

12-12: MPI_CMD default looks good

Optional MPI wrapper without changing defaults. LGTM.

components/backends/trtllm/launch/agg_router.sh (2)

9-9: MPI_CMD default looks good

Optional MPI wrapper without changing defaults. LGTM.

26-30: LGTM

Worker launch correctly prefixed with MPI_CMD.

components/backends/trtllm/launch/gpt_oss_disagg.sh (1)

11-11: MPI_CMD default looks good

Optional MPI wrapper without changing defaults. LGTM.

components/backends/trtllm/launch/epd_disagg.sh (1)

19-19: MPI_CMD default looks good

Optional MPI wrapper without changing defaults. LGTM.

components/backends/trtllm/launch/agg_metrics.sh (1)

10-10: MPI_CMD default looks good

Optional MPI wrapper without changing defaults. LGTM.

components/backends/trtllm/launch/disagg.sh (1)

16-16: MPI_CMD default looks good

Optional MPI wrapper without changing defaults. LGTM.

Signed-off-by: Django Zhang <[email protected]>

pull-request-size bot added the size/M label Oct 10, 2025

github-actions bot added the external-contribution Pull request is from an external contributor label Oct 10, 2025

zhjunqin changed the title ~~add MPI RUN prefix to trtllm launch scripts~~ feat: add MPI RUN prefix to trtllm launch scripts Oct 10, 2025

github-actions bot added the feat label Oct 10, 2025

coderabbitai bot reviewed Oct 10, 2025

View reviewed changes

add MPI RUN prefix to trtllm launch scripts

7e0f828

Signed-off-by: Django Zhang <[email protected]>

zhjunqin force-pushed the add-trtllm-mpirun-prefix branch from ccdd71b to 7e0f828 Compare October 11, 2025 01:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add MPI RUN prefix to trtllm launch scripts #3546

feat: add MPI RUN prefix to trtllm launch scripts #3546

Uh oh!

zhjunqin commented Oct 10, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Oct 10, 2025

Uh oh!

github-actions bot commented Oct 10, 2025

Uh oh!

coderabbitai bot commented Oct 10, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat: add MPI RUN prefix to trtllm launch scripts #3546

Are you sure you want to change the base?

feat: add MPI RUN prefix to trtllm launch scripts #3546

Uh oh!

Conversation

zhjunqin commented Oct 10, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Oct 10, 2025

Uh oh!

github-actions bot commented Oct 10, 2025

Uh oh!

coderabbitai bot commented Oct 10, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zhjunqin commented Oct 10, 2025 •

edited by coderabbitai bot

Loading