Skip to content

Conversation

zhjunqin
Copy link

@zhjunqin zhjunqin commented Oct 10, 2025

Overview:

Add $MPI_CMD to all trtllm launch sciprts.

Details:

To help trtllm launch sciprts can be run on slurm cluster easier. Just need to set the new envrionment MPI_CMD.

Default MPI_CMD to null, then it would change default behavior.

Where should the reviewer start?

N/A

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

N/A

Summary by CodeRabbit

  • New Features
    • Added optional MPI integration across TRT-LLM launch scripts (agg, agg_metrics, agg_router, disagg, disagg_router, epd_disagg, gpt_oss_disagg) via a new MPI_CMD environment variable.
    • When MPI_CMD is set, worker processes (e.g., encode/prefill/decode) run under the specified MPI command; leaving it empty preserves existing behavior.
    • Enables easier scaling with MPI without changing default launch workflows.

Copy link

copy-pr-bot bot commented Oct 10, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link

👋 Hi zhjunqin! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

@github-actions github-actions bot added the external-contribution Pull request is from an external contributor label Oct 10, 2025
@zhjunqin zhjunqin changed the title add MPI RUN prefix to trtllm launch scripts feat: add MPI RUN prefix to trtllm launch scripts Oct 10, 2025
@github-actions github-actions bot added the feat label Oct 10, 2025
Copy link
Contributor

coderabbitai bot commented Oct 10, 2025

Walkthrough

Adds an optional MPI_CMD environment variable (default empty) across TRT-LLM launch scripts and prepends it to python invocations, enabling MPI-wrapped execution of dynamo.trtllm workers without altering other control flow.

Changes

Cohort / File(s) Summary of Changes
Aggregated launch scripts — MPI wrapper
components/backends/trtllm/launch/agg.sh, .../agg_metrics.sh, .../agg_router.sh
Export MPI_CMD (default ""), and prefix worker launches with $MPI_CMD python3 -m dynamo.trtllm ... instead of plain python3.
Disaggregated launch scripts — MPI wrapper
components/backends/trtllm/launch/disagg.sh, .../disagg_router.sh, .../epd_disagg.sh, .../gpt_oss_disagg.sh
Export MPI_CMD (default ""), and prepend $MPI_CMD to encode/prefill/decode worker invocations, keeping other arguments unchanged.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant U as Operator
  participant L as Launch Script
  participant M as MPI (optional)
  participant P as dynamo.trtllm (python)

  U->>L: Run launch script
  alt MPI_CMD set
    L->>M: Invoke "$MPI_CMD python3 -m dynamo.trtllm ..."
    M->>P: Start module with provided args
  else MPI_CMD empty
    L->>P: Invoke "python3 -m dynamo.trtllm ..."
  end
  P-->>L: Worker lifecycle (encode/prefill/decode)
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

I twitch my whiskers, press “mpirun” with glee,
A hop, a skip—now workers launch in harmony.
If empty, I scamper the usual way,
If set, I bound in parallel play.
Carrots queued, ranks aligned—
Cluster chorus, neatly timed. 🥕✨

Pre-merge checks

✅ Passed checks (3 passed)
Check name Status Explanation
Title Check ✅ Passed The title succinctly and accurately captures the primary change of adding an MPI run prefix to all trtllm launch scripts, making it clear and specific without extraneous detail or vague language.
Description Check ✅ Passed The pull request description adheres to the repository’s template by including the Overview, Details, Where should the reviewer start, and Related Issues sections, and it clearly explains the purpose and change even though some entries are minimal with “N/A” placeholders.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
components/backends/trtllm/launch/gpt_oss_disagg.sh (1)

22-35: Same MPI env propagation caveat as noted in disagg_router.sh

If using OpenMPI across nodes, include -x CUDA_VISIBLE_DEVICES in MPI_CMD to ensure ranks inherit GPU masks.

Also applies to: 37-48

🧹 Nitpick comments (5)
components/backends/trtllm/launch/disagg_router.sh (1)

39-45: Ensure env vars propagate to MPI ranks (OpenMPI needs -x)

With CUDA_VISIBLE_DEVICES=... $MPI_CMD python3 ..., the inline env is set for the launcher. On OpenMPI across nodes, ranks may not inherit unless you pass -x CUDA_VISIBLE_DEVICES (and any other inline vars) in MPI_CMD. Example:

  • MPI_CMD='mpirun -np 2 -x CUDA_VISIBLE_DEVICES'

Action:

  • Document this requirement in README or comments here, or ensure users include the proper -x flags when setting MPI_CMD. For Slurm srun, env usually propagates by default.

Would you like me to add a brief comment/example to these scripts?

Also applies to: 49-55

components/backends/trtllm/launch/agg.sh (1)

29-33: Optional: add strict mode for consistency

Consider set -euo pipefail (like in gpt_oss_disagg.sh uses set -e) to fail fast on errors in all launch scripts.

components/backends/trtllm/launch/epd_disagg.sh (1)

36-45: Propagate env to MPI ranks

Inline CUDA_VISIBLE_DEVICES=... $MPI_CMD python3 ... sets the var for the launcher; ranks may not inherit on OpenMPI unless -x CUDA_VISIBLE_DEVICES is included in MPI_CMD. Recommend documenting this expectation for multi-node runs.

Also applies to: 48-56, 59-66

components/backends/trtllm/launch/agg_metrics.sh (1)

26-32: Propagate DYN_SYSTEM_ to MPI ranks*

DYN_SYSTEM_ENABLED=true DYN_SYSTEM_PORT=8081 $MPI_CMD ... sets env for the launcher. On OpenMPI, add -x DYN_SYSTEM_ENABLED -x DYN_SYSTEM_PORT to MPI_CMD if running across nodes so ranks see these.

components/backends/trtllm/launch/disagg.sh (1)

33-40: Propagate CUDA_VISIBLE_DEVICES to MPI ranks

For OpenMPI multi-node runs, ensure MPI_CMD includes -x CUDA_VISIBLE_DEVICES so ranks inherit the GPU mask (or document this requirement).

Also applies to: 43-49

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0a2a820 and ccdd71b.

📒 Files selected for processing (7)
  • components/backends/trtllm/launch/agg.sh (2 hunks)
  • components/backends/trtllm/launch/agg_metrics.sh (2 hunks)
  • components/backends/trtllm/launch/agg_router.sh (2 hunks)
  • components/backends/trtllm/launch/disagg.sh (3 hunks)
  • components/backends/trtllm/launch/disagg_router.sh (3 hunks)
  • components/backends/trtllm/launch/epd_disagg.sh (4 hunks)
  • components/backends/trtllm/launch/gpt_oss_disagg.sh (3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (8)
components/backends/trtllm/launch/disagg_router.sh (1)

13-13: MPI_CMD default looks good

Optional MPI wrapper without changing defaults. LGTM.

components/backends/trtllm/launch/agg.sh (1)

12-12: MPI_CMD default looks good

Optional MPI wrapper without changing defaults. LGTM.

components/backends/trtllm/launch/agg_router.sh (2)

9-9: MPI_CMD default looks good

Optional MPI wrapper without changing defaults. LGTM.


26-30: LGTM

Worker launch correctly prefixed with MPI_CMD.

components/backends/trtllm/launch/gpt_oss_disagg.sh (1)

11-11: MPI_CMD default looks good

Optional MPI wrapper without changing defaults. LGTM.

components/backends/trtllm/launch/epd_disagg.sh (1)

19-19: MPI_CMD default looks good

Optional MPI wrapper without changing defaults. LGTM.

components/backends/trtllm/launch/agg_metrics.sh (1)

10-10: MPI_CMD default looks good

Optional MPI wrapper without changing defaults. LGTM.

components/backends/trtllm/launch/disagg.sh (1)

16-16: MPI_CMD default looks good

Optional MPI wrapper without changing defaults. LGTM.

@zhjunqin zhjunqin force-pushed the add-trtllm-mpirun-prefix branch from ccdd71b to 7e0f828 Compare October 11, 2025 01:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

external-contribution Pull request is from an external contributor feat size/M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant