Skip to content

Conversation

@yuki-97
Copy link
Contributor

@yuki-97 yuki-97 commented Sep 5, 2025

What does this PR do ?

  1. Support DP inside vLLM to unblock EP, since vLLM's EP = DP * TP.
    1. vLLM's DP does have extra overhead, so this PR only use it when really needed (when EP > 1 and EP != TP), we'll still control DP outside vLLM in other cases.
  2. Update the param enable_expert_parallel to expert_parallel_size.
    1. enable_expert_parallel are all set to 1 by default, so that nothing will be affected.

For how does DP in generation side be like after this PR:

Assume we have 256 GPUs and run with TP = 8 EP = 64, then:
global_DP_SIZE = total_GPUs / model_parallel_size = 256 / 8 = 32
vLLM_DP_SIZE = EP / TP = 64 / 8 = 8 for each EP group

Issues

Related #908.

Test Result

green: main branch baseline, TP32
blue: feature branch, using the same setting as baseline
red/pink: feature branch, TP8EP64

Convergence
image

Time cost
The EP setting is not fine tuned, so the generation time is slower.
image

Summary by CodeRabbit

  • New Features

    • Switched expert-parallel setting from a boolean to expert_parallel_size (integer); added explicit sizing for TP/PP/EP/DP and automatic DP environment setup when applicable.
    • Per-worker address/port discovery injected into worker environments for coordinated launches.
  • Reliability

    • Improved cluster startup with retries, resource validation, timeouts, clearer errors, and runtime consistency checks.
  • Refactor

    • Example configs updated to use expert_parallel_size.
  • Tests

    • Test configs updated to the new expert_parallel_size setting.

terrykong
terrykong previously approved these changes Sep 10, 2025
Copy link
Contributor

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. @parthchadha to review

Signed-off-by: Yuki Huang <[email protected]>

fix type

Signed-off-by: Yuki Huang <[email protected]>

fix rank

Signed-off-by: Yuki Huang <[email protected]>

fix local rank

Signed-off-by: Yuki Huang <[email protected]>
Signed-off-by: Yuki Huang <[email protected]>
Signed-off-by: Yuki Huang <[email protected]>
Signed-off-by: Yuki Huang <[email protected]>
Signed-off-by: Yuki Huang <[email protected]>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 16, 2025

Walkthrough

Configs replace boolean enable_expert_parallel with integer expert_parallel_size. vLLM generation and worker code introduce explicit TP/PP/EP/DP sizing, checks, device-UUID IPC distribution, and DP env wiring. Virtual cluster adds per-(pg,bundle) address/port retrieval with retries/timeouts; worker groups inject address/port lists into worker envs.

Changes

Cohort / File(s) Summary
Examples: config key change
examples/configs/.../*.yaml, examples/configs/recipes/llm/*, examples/configs/recipes/vlm/*, examples/configs/evals/eval.yaml
Replace enable_expert_parallel: false with expert_parallel_size: 1 in vLLM/dtensor generation config blocks across example YAMLs.
Virtual cluster addressing
nemo_rl/distributed/virtual_cluster.py
Add get_available_address_and_port(pg_idx,bundle_idx); make get_master_address_and_port() delegate to (0,0); add retries/backoff, resource checks, 180s readiness timeout, improved errors and doc updates.
Worker env provisioning
nemo_rl/distributed/worker_groups.py
Discover per-(pg,bundle) available addresses/ports and inject AVAILABLE_ADDR_LIST and AVAILABLE_PORT_LIST into each worker's environment when creating workers.
vLLM config typing
nemo_rl/models/generation/vllm/config.py
Rename VllmSpecificArgs.enable_expert_parallel: boolexpert_parallel_size: int.
vLLM generation orchestration
nemo_rl/models/generation/vllm/vllm_generation.py
Introduce explicit tp_size/pp_size/ep_size/model_parallel_size/dp_size/vllm_dp_size, add divisibility/consistency assertions and PP/EP constraints, set NCCL/CUMEM/VLLM DP env vars, add device UUID reporting and per-device IPC-handle distribution, and pass env vars to worker groups.
vLLM worker behavior
nemo_rl/models/generation/vllm/vllm_worker.py
Add public expert_parallel_size (from config); derive enable_expert_parallel = expert_parallel_size > 1; when expert_parallel_size > tensor_parallel_size, configure vLLM DP env vars (DP rank, local rank, DP master IP/port) using AVAILABLE_* lists.
Unit tests: config updates
tests/unit/.../*.py
Update test vLLM configs to use expert_parallel_size: 1 instead of enable_expert_parallel: False.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Orchestrator
  participant VirtualCluster
  participant WorkerGroups
  participant Worker as "vLLM Worker"

  Orchestrator->>VirtualCluster: init/create placement groups (retries/backoff, timeout)
  VirtualCluster-->>Orchestrator: placement groups ready + (ip,port) per bundle

  Orchestrator->>WorkerGroups: request workers (bundle_indices_list)
  loop per (pg_idx,bundle_idx)
    WorkerGroups->>VirtualCluster: get_available_address_and_port(pg_idx,bundle_idx)
    VirtualCluster-->>WorkerGroups: (ip, port)
  end
  WorkerGroups->>Worker: spawn with env { AVAILABLE_ADDR_LIST, AVAILABLE_PORT_LIST, ... }

  Worker->>Worker: read cfg.expert_parallel_size and tensor_parallel_size
  alt expert_parallel_size > tensor_parallel_size
    Worker->>Worker: compute DP sizing, set VLLM_DP_RANK / VLLM_DP_RANK_LOCAL
    Worker->>Worker: set VLLM_DP_MASTER_IP and VLLM_DP_MASTER_PORT from AVAILABLE_* lists
  end
  Worker-->>Orchestrator: worker ready
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Suggested reviewers

  • yaoyu-33
  • terrykong

Poem

A rabbit taps keys with a parallel grin,
Flags turned to sizes — let the experts begin.
Ports and ranks arranged in tidy rows,
UUIDs whisper where each IPC goes.
Hoppity hops, clusters hum and spin. 🥕

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title concisely and accurately summarizes the primary change — adding support for data-parallelism (DP) inside vLLM to enable expert parallelism (EP) — and aligns with the PR objectives and the described code changes (DP/EP handling, config key rename, worker/env updates). It is short, specific, and developer-focused, so it is appropriate for the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch yukih/vllm-ep-size

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
examples/configs/recipes/llm/grpo-deepscaler-1.5b-24K.yaml (1)

41-49: No 'enable_expert_parallel' config key found; fix vllm config typing

  • Sweep: no configs/tests contain an "enable_expert_parallel" config key — it's only a runtime variable in nemo_rl/models/generation/vllm/vllm_worker.py (set from expert_parallel_size and passed to vLLM).
  • Action (critical): nemo_rl/models/generation/vllm/config.py currently declares "expert_parallel_size: bool" (line ~23) but configs/tests use integers. Change that to "expert_parallel_size: int" and run validation/tests.
🧹 Nitpick comments (22)
examples/configs/recipes/llm/grpo-gspo-deepscaler-1.5b-8K.yaml (1)

105-109: Consistent migration to expert_parallel_size.

EP=1 with TP=1 is a no-op; OK. Add the EP%TP note for clarity.

-      expert_parallel_size: 1
+      # EP > 1 requires EP % TP == 0
+      expert_parallel_size: 1
examples/configs/recipes/llm/grpo-qwen2.5-7b-instruct-4n8g-megatron.yaml (1)

151-154: Migration LGTM; EP disabled by default.

Here TP=4; keep a comment to avoid misconfig when enabling EP later.

-      expert_parallel_size: 1
+      # If you enable EP, set expert_parallel_size to a multiple of TP (=4 here)
+      expert_parallel_size: 1
examples/configs/recipes/llm/grpo-llama3.1-8b-instruct-2n8g-fsdp2tp1-noncolocated.yaml (1)

99-101: Rename is correct; matches PR semantics (EP=1 => vLLM DP off).

Optional: add the EP%TP constraint comment.

-      expert_parallel_size: 1
+      # EP > 1 requires EP % TP == 0
+      expert_parallel_size: 1
examples/configs/recipes/llm/grpo-qwen2.5-32b-32n8g-fsdp2tp8sp-actckpt.v3.yaml (1)

99-101: Good: default EP off; TP=4 here.

Add a guardrail comment to prevent invalid EP later.

-      expert_parallel_size: 1
+      # If enabling EP, choose a multiple of TP (=4)
+      expert_parallel_size: 1
examples/configs/recipes/llm/grpo-llama3.1-8b-instruct-4n8g-fsdp2tp1-long.v3.yaml (1)

99-101: Change is consistent; EP off by default.

Same optional comment suggestion.

-      expert_parallel_size: 1
+      # EP > 1 requires EP % TP == 0
+      expert_parallel_size: 1
examples/configs/recipes/llm/grpo-math-qwen3-30ba3b-megatron-tp4-32k.yaml (1)

133-138: Rename aligns with code; maintainers note for EP/TP constraint recommended.

TP=4 in this config; comment will help future edits.

-      expert_parallel_size: 1
+      # To enable EP, set to a multiple of TP (=4). Otherwise keep at 1.
+      expert_parallel_size: 1
examples/configs/recipes/llm/grpo-deepscaler-1.5b-8K.yaml (1)

104-108: LGTM; EP=1 keeps previous behavior.

Add the EP%TP reminder to reduce misconfiguration.

-      expert_parallel_size: 1
+      # EP > 1 requires EP % TP == 0
+      expert_parallel_size: 1
examples/configs/evals/eval.yaml (1)

24-24: EP size switch ACK; default safe for single-GPU eval.

Value 1 keeps EP disabled. Consider adding a comment that EP>1 on eval requires EP % TP == 0 and enough GPUs.

examples/configs/recipes/llm/grpo-gemma3-27b-it-8n8g-fsdp2tp8-actckpt-long.yaml (1)

99-99: Migration is correct; keep EP constraints visible to users.

Optional: add a YAML comment noting EP > 1 requires EP % TP == 0 and may enable vLLM-internal DP = EP/TP.

examples/configs/recipes/llm/grpo-llama3.2-1b-instruct-1n8g-megatron.yaml (1)

129-129: Good rename; default preserves behavior.

No functional change. If you plan to exercise the new DP-in-vLLM path in CI, consider adding a variant with EP > 1 and TP = 1 for coverage.

examples/configs/recipes/llm/grpo-qwen2.5-math-1.5b-instruct-1n8g-fsdp2tp1.v3.yaml (1)

99-99: Rename LGTM; add minimal guardrail note (optional).

A short inline comment like “1 disables EP; >1 requires EP % TP == 0” could prevent misconfigurations.

examples/configs/recipes/llm/grpo-qwen2.5-7b-instruct-4n8g-fsdp2tp4sp.v3.yaml (1)

99-99: Config update is correct.

Given TP=4, if someone sets EP>1 later, remind EP must be a multiple of 4; DP_in_vLLM = EP/4. Optional comment.

examples/configs/recipes/llm/grpo-deepscaler-1.5b-24K.yaml (1)

41-49: Rename to expert_parallel_size looks good; default preserves behavior.

Setting expert_parallel_size: 1 is backward‑compatible. When users tune EP > 1, please document that EP must be divisible by TP (EP % TP == 0) so vLLM DP = EP/TP is integral.

examples/configs/recipes/llm/grpo-llama3.2-1b-instruct-1n8g-fsdp2tp1.v3.yaml (1)

95-102: Good default for EP.

expert_parallel_size: 1 preserves previous behavior; note in docs that EP > 1 requires EP % TP == 0.

tests/unit/environments/test_code_environment.py (1)

51-62: Use boolean for enforce_eager (avoid string).

The current value is a string "False"; prefer boolean False to match runtime expectations.

Apply this diff:

         "gpu_memory_utilization": 0.6,
-        "enforce_eager": "False",
+        "enforce_eager": False,
tests/unit/models/generation/test_vllm_generation.py (2)

53-65: EP key migration OK; also fix enforce_eager type.

Key rename is correct. Please switch enforce_eager from string to boolean for consistency.

         "load_format": "auto",
-        "enforce_eager": "False",
+        "enforce_eager": False,

53-65: Add a minimal EP>1 test to cover new DP-in-vLLM path.

Parametrize a case with TP=1, EP=2 to validate layout checks and env wiring.

# Outside this block: add a new test
def test_vllm_ep_gt_1_layout(cluster, tokenizer):
    cfg = deepcopy(basic_vllm_test_config)
    cfg = configure_generation_config(cfg, tokenizer, is_eval=True)
    cfg["vllm_cfg"]["tensor_parallel_size"] = 1
    cfg["vllm_cfg"]["expert_parallel_size"] = 2  # triggers vLLM-internal DP
    v = VllmGeneration(cluster, cfg)
    v.finish_generation()  # init path exercises EP layout
    v.shutdown()
tests/unit/environments/test_retriever.py (1)

51-61: Set enforce_eager to boolean False.

Avoid quoting booleans in config dict.

         "gpu_memory_utilization": 0.6,
-        "enforce_eager": "False",
+        "enforce_eager": False,
nemo_rl/models/generation/vllm/vllm_generation.py (1)

154-163: Make vLLM DP env setting robust; optional NCCL resiliency

  • Gate VLLM_DP_SIZE on computed vllm_dp_size > 1 (covers future changes).
  • Optional: set NCCL_ASYNC_ERROR_HANDLING=1 for safer comms.

Apply this diff:

         if not self.cfg["colocated"]["enabled"]:
             env_vars["NCCL_CUMEM_ENABLE"] = "1"
-        # We should use vLLM DP if ep_size > tp_size since EP_SIZE = DP_SIZE * TP_SIZE in vLLM.
-        # See details in https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/data_parallel.py
-        if self.ep_size > self.tp_size:
-            env_vars["VLLM_DP_SIZE"] = str(self.vllm_dp_size)
+        # Use vLLM internal DP when DP_SIZE > 1 (EP = DP * TP).
+        if self.vllm_dp_size > 1:
+            env_vars["VLLM_DP_SIZE"] = str(self.vllm_dp_size)
+        # Optional resiliency:
+        # env_vars.setdefault("NCCL_ASYNC_ERROR_HANDLING", "1")
nemo_rl/distributed/virtual_cluster.py (3)

354-372: Validate pg_idx and bundle_idx before indexing

Out-of-range indices will surface as IndexError. Fail fast with a clear message.

Apply this diff:

-        placement_groups = self.get_placement_groups()
-        if len(placement_groups) == 1:
-            pg = placement_groups[0]
-        else:
-            pg = placement_groups[pg_idx]
+        placement_groups = self.get_placement_groups()
+        if not placement_groups:
+            raise RuntimeError("No placement groups available")
+        if len(placement_groups) == 1:
+            pg = placement_groups[0]
+        else:
+            if not (0 <= pg_idx < len(placement_groups)):
+                raise ValueError(
+                    f"pg_idx out of range: {pg_idx} (available={len(placement_groups)})"
+                )
+            pg = placement_groups[pg_idx]
+        if not (0 <= bundle_idx < pg.bundle_count):
+            raise ValueError(
+                f"bundle_idx out of range: {bundle_idx} (bundle_count={pg.bundle_count})"
+            )

375-383: Confirm Ray behavior with num_cpus=0 + PG scheduling across Ray versions

Setting num_cpus=0 is intentional to avoid contention, but older Ray versions had quirks with PG enforcement on zero‑resource tasks. If you target older clusters, consider reserving a fractional CPU (e.g., 0.01) on the bundle.

Would you like a small compatibility matrix check script for your supported Ray versions?


386-388: Minor: tighten error message or use a dedicated exception

Message is fine; if you want to appease linters, shorten or use a custom exception type.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 52f1d6a and 14b45e1.

📒 Files selected for processing (32)
  • examples/configs/evals/eval.yaml (1 hunks)
  • examples/configs/grpo_math_1B.yaml (1 hunks)
  • examples/configs/grpo_sliding_puzzle.yaml (1 hunks)
  • examples/configs/recipes/llm/grpo-deepscaler-1.5b-24K.yaml (1 hunks)
  • examples/configs/recipes/llm/grpo-deepscaler-1.5b-8K.yaml (1 hunks)
  • examples/configs/recipes/llm/grpo-gemma3-1b-it-1n8g-fsdp2tp1.yaml (1 hunks)
  • examples/configs/recipes/llm/grpo-gemma3-27b-it-8n8g-fsdp2tp8-actckpt-long.yaml (1 hunks)
  • examples/configs/recipes/llm/grpo-gspo-deepscaler-1.5b-8K.yaml (1 hunks)
  • examples/configs/recipes/llm/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8.yaml (1 hunks)
  • examples/configs/recipes/llm/grpo-llama3.1-8b-instruct-2n8g-fsdp2tp1-noncolocated.yaml (1 hunks)
  • examples/configs/recipes/llm/grpo-llama3.1-8b-instruct-4n8g-fsdp2tp1-long.v3.yaml (1 hunks)
  • examples/configs/recipes/llm/grpo-llama3.2-1b-instruct-1n8g-fsdp2tp1.v3.yaml (1 hunks)
  • examples/configs/recipes/llm/grpo-llama3.2-1b-instruct-1n8g-megatron.yaml (1 hunks)
  • examples/configs/recipes/llm/grpo-math-qwen3-30ba3b-megatron-tp4-32k.yaml (1 hunks)
  • examples/configs/recipes/llm/grpo-qwen2.5-32b-32n8g-fsdp2tp8sp-actckpt-long.v3.yaml (1 hunks)
  • examples/configs/recipes/llm/grpo-qwen2.5-32b-32n8g-fsdp2tp8sp-actckpt.v3.yaml (1 hunks)
  • examples/configs/recipes/llm/grpo-qwen2.5-7b-instruct-4n8g-fsdp2tp4sp.v3.yaml (1 hunks)
  • examples/configs/recipes/llm/grpo-qwen2.5-7b-instruct-4n8g-megatron.yaml (1 hunks)
  • examples/configs/recipes/llm/grpo-qwen2.5-math-1.5b-instruct-1n8g-fsdp2tp1.v3.yaml (1 hunks)
  • examples/configs/recipes/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n2g-dtensor2tp1.v1.yaml (1 hunks)
  • examples/configs/recipes/vlm/vlm_grpo-smolvlm2-2.2b-instruct-clevr-1n2g-dtensor2tp1.v1.yaml (1 hunks)
  • examples/configs/vlm_grpo_3B.yaml (1 hunks)
  • nemo_rl/distributed/virtual_cluster.py (2 hunks)
  • nemo_rl/distributed/worker_groups.py (2 hunks)
  • nemo_rl/models/generation/vllm/config.py (1 hunks)
  • nemo_rl/models/generation/vllm/vllm_generation.py (4 hunks)
  • nemo_rl/models/generation/vllm/vllm_worker.py (2 hunks)
  • tests/unit/environments/test_code_environment.py (1 hunks)
  • tests/unit/environments/test_retriever.py (1 hunks)
  • tests/unit/experience/test_rollouts.py (1 hunks)
  • tests/unit/models/generation/test_vllm_generation.py (1 hunks)
  • tests/unit/models/generation/test_vllm_large_model.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
nemo_rl/distributed/worker_groups.py (1)
nemo_rl/distributed/virtual_cluster.py (1)
  • get_available_address_and_port (354-388)
nemo_rl/models/generation/vllm/vllm_generation.py (3)
nemo_rl/distributed/worker_groups.py (1)
  • dp_size (590-592)
tests/unit/models/generation/test_vllm_generation.py (1)
  • cluster (211-222)
nemo_rl/distributed/virtual_cluster.py (1)
  • world_size (348-349)
🪛 Ruff (0.12.2)
nemo_rl/distributed/worker_groups.py

463-463: Loop control variable group_idx not used within loop body

(B007)


464-464: Loop control variable local_rank not used within loop body

(B007)

nemo_rl/models/generation/vllm/vllm_worker.py

347-347: Use of possibly insecure function; consider using ast.literal_eval

(S307)


348-348: Use of possibly insecure function; consider using ast.literal_eval

(S307)

nemo_rl/distributed/virtual_cluster.py

386-388: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Lint check
🔇 Additional comments (11)
examples/configs/grpo_sliding_puzzle.yaml (1)

44-46: Rename to expert_parallel_size looks good; preserves default off (1).

File: examples/configs/grpo_sliding_puzzle.yaml (lines 44–46)

Add a short inline note that when setting expert_parallel_size > 1 it must be a multiple of tensor_parallel_size.

-      expert_parallel_size: 1
+      # Set >1 to enable EP. Constraint: expert_parallel_size % tensor_parallel_size == 0
+      expert_parallel_size: 1

Verification not completed: the ripgrep run returned "No files were searched". Re-run from the repo root to confirm there are no lingering enable_expert_parallel keys and that an alias/deprecation path exists. Suggested command:

rg -n --hidden -S --no-ignore 'enable_expert_parallel|expert_parallel_size' -g '!**/site-packages/**'
examples/configs/recipes/llm/grpo-gemma3-1b-it-1n8g-fsdp2tp1.yaml (1)

98-98: Rename LGTM.

Defaults are sane; nothing further.

examples/configs/recipes/llm/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8.yaml (1)

128-128: Migration OK; FP8 generation unaffected.

Nothing blocking. If you intend to demo EP>1 here, ensure kernel/layout support matches FP8 path.

examples/configs/recipes/llm/grpo-qwen2.5-32b-32n8g-fsdp2tp8sp-actckpt-long.v3.yaml (1)

99-99: expert_parallel_size default preserved — add YAML hint & confirm TP mismatch

  • vllm_worker sets enable_expert_parallel = expert_parallel_size > 1 and passes it through (nemo_rl/models/generation/vllm/vllm_worker.py:138, 386) — expert_parallel_size: 1 preserves current behavior.
  • Add a one-line YAML comment next to examples/configs/recipes/llm/grpo-qwen2.5-32b-32n8g-fsdp2tp8sp-actckpt-long.v3.yaml (around line 99) stating: "If >1, expert_parallel_size must be a multiple of generation.vllm_cfg.tensor_parallel_size; vLLM internal DP = EP / TP."
  • Quick rg run did not surface tensor_parallel_size entries under examples/configs — confirm whether the dtensor_cfg.tensor_parallel_size=8 vs generation.vllm_cfg.tensor_parallel_size=4 (train vs gen) mismatch is intentional; if not, align or document.
examples/configs/recipes/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n2g-dtensor2tp1.v1.yaml (1)

110-117: EP size key migration OK.

expert_parallel_size: 1 aligns with new runtime logic and keeps EP disabled by default.

examples/configs/vlm_grpo_3B.yaml (1)

111-118: Consistent EP config.

Good: expert_parallel_size replaces the old boolean and stays at 1.

examples/configs/recipes/vlm/vlm_grpo-smolvlm2-2.2b-instruct-clevr-1n2g-dtensor2tp1.v1.yaml (1)

109-116: Looks good.

EP key rename applied consistently with value 1.

examples/configs/grpo_math_1B.yaml (1)

181-182: YAML update is consistent with the new EP semantics.

Comment is correct (EP must be a multiple of TP).

tests/unit/experience/test_rollouts.py (1)

201-202: Config rename to expert_parallel_size is correct here.

Keeps default EP=1 for tests.

nemo_rl/models/generation/vllm/vllm_worker.py (1)

137-139: Deriving enable_expert_parallel from expert_parallel_size is correct.

This aligns with the new config contract (EP enabled iff size > 1).

Confirm all config sources now pass an int for expert_parallel_size (no strings).

nemo_rl/distributed/virtual_cluster.py (1)

390-397: Delegation wrapper looks good

Backwards compatibility via delegation to (pg_idx=0, bundle_idx=0) is clean.

Signed-off-by: Yuki Huang <[email protected]>
parthchadha
parthchadha previously approved these changes Sep 16, 2025
Signed-off-by: Yuki Huang <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
nemo_rl/models/generation/vllm/vllm_generation.py (3)

58-69: Compute sizes as ints, raise instead of assert, and defer vLLM-DP calc.

Casts avoid accidental str math; explicit exceptions survive -O; compute vllm_dp_size only after EP/TP validation.

-        self.tp_size = self.cfg["vllm_cfg"]["tensor_parallel_size"]
-        self.pp_size = self.cfg["vllm_cfg"]["pipeline_parallel_size"]
-        self.ep_size = self.cfg["vllm_cfg"]["expert_parallel_size"]
-        self.model_parallel_size = self.tp_size * self.pp_size
-
-        assert cluster.world_size() % self.model_parallel_size == 0, (
-            "World size must be a multiple of model parallel size. "
-            f"Got world size {cluster.world_size()} and model parallel size (TP * PP) {self.model_parallel_size}."
-        )
-        self.dp_size = cluster.world_size() // self.model_parallel_size
-        self.vllm_dp_size = self.ep_size // self.tp_size
+        vllm_cfg = self.cfg["vllm_cfg"]
+        self.tp_size = int(vllm_cfg["tensor_parallel_size"])
+        self.pp_size = int(vllm_cfg["pipeline_parallel_size"])
+        self.ep_size = int(vllm_cfg["expert_parallel_size"])
+        self.model_parallel_size = self.tp_size * self.pp_size
+
+        total_mp = self.model_parallel_size
+        if cluster.world_size() % total_mp != 0:
+            raise ValueError(
+                f"world_size={cluster.world_size()} is not divisible by TP*PP={self.tp_size}*{self.pp_size}={total_mp}"
+            )
+        self.dp_size = cluster.world_size() // total_mp

70-88: Replace assert-based config checks; validate EP vs TP before computing vLLM-DP.

Asserts can be optimized out; make the checks deterministic and only then set vllm_dp_size.

-        if self.pp_size > 1:
-            assert self.cfg["vllm_cfg"]["async_engine"], (
-                "When pipeline_parallel_size > 1, async_engine must be set to True in the vLLM configuration. "
-                "You can enable it by adding `policy.generation.vllm_cfg.async_engine=true` to your command."
-            )
+        if self.pp_size > 1 and not vllm_cfg["async_engine"]:
+            raise ValueError(
+                "When pipeline_parallel_size > 1, set vllm_cfg.async_engine=true."
+            )
 
-        if self.ep_size > 1:
-            assert self.ep_size % self.tp_size == 0, (
-                "When EP > 1, EP must be a multiple of TP since vLLM's EP = DP * TP. "
-                "Please update your configuration to set expert_parallel_size to a multiple of tensor_parallel_size."
-            )
-            if self.ep_size != self.tp_size:
-                # vLLM's EP = DP * TP, so here we need to use DP inside vLLM.
-                assert not self.cfg["vllm_cfg"]["async_engine"], (
-                    "vLLM async_engine has some issues when using DP inside vLLM. "
-                    "Please update your configuration to set `policy.generation.vllm_cfg.async_engine=false`. "
-                    "See https://github.com/NVIDIA-NeMo/RL/issues/1101 for more details."
-                )
+        if self.ep_size > 1:
+            if self.ep_size % self.tp_size != 0:
+                raise ValueError(
+                    "When EP > 1, expert_parallel_size must be a multiple of tensor_parallel_size (EP = DP * TP in vLLM)."
+                )
+            if self.ep_size != self.tp_size and vllm_cfg["async_engine"]:
+                raise ValueError(
+                    "async_engine is not supported when using DP inside vLLM (EP != TP). Set vllm_cfg.async_engine=false. "
+                    "See https://github.com/NVIDIA-NeMo/RL/issues/1101 for more details."
+                )
+        # vLLM DP size computed after EP/TP validation
+        self.vllm_dp_size = max(1, self.ep_size // self.tp_size)

199-201: Don’t use assert for DP-size contract; raise a deterministic error.

Prevents silent bypass under python -O.

-        assert self.dp_size == self.worker_group.dp_size, (
-            f"Data parallel size mismatch. Expected {self.dp_size}, got {self.worker_group.dp_size}"
-        )
+        if self.dp_size != self.worker_group.dp_size:
+            raise RuntimeError(
+                f"Data-parallel size mismatch: expected {self.dp_size}, got {self.worker_group.dp_size}"
+            )
🧹 Nitpick comments (1)
nemo_rl/models/generation/vllm/vllm_generation.py (1)

93-114: Fix non-interpolated constants in error messages.

These lines reference placeholders without f-strings; users will see braces literally.

-                    "does not return logprobs after top_k filtering. Values >= {TOP_K_THRESHOLD} are allowed "
+                    f"does not return logprobs after top_k filtering. Values >= {TOP_K_THRESHOLD} are allowed "
...
-                    "does not return logprobs after top_p filtering. Values >= {TOP_P_THRESHOLD} are allowed "
+                    f"does not return logprobs after top_p filtering. Values >= {TOP_P_THRESHOLD} are allowed "
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0a869c8 and 43a6808.

📒 Files selected for processing (1)
  • nemo_rl/models/generation/vllm/vllm_generation.py (4 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
nemo_rl/models/generation/vllm/vllm_generation.py (3)
tests/unit/models/generation/test_vllm_generation.py (1)
  • cluster (211-222)
nemo_rl/distributed/virtual_cluster.py (1)
  • world_size (348-349)
nemo_rl/distributed/worker_groups.py (1)
  • dp_size (590-592)
🔇 Additional comments (2)
nemo_rl/models/generation/vllm/vllm_generation.py (2)

127-132: NamedSharding reshape looks good given the divisibility guard.

With the explicit divisibility check above, this reshape is safe.


160-169: Env wiring for VLLM_DP_SIZE is correct and scoped.

Setting VLLM_DP_SIZE only when EP > TP matches the DP-inside-vLLM contract.

@terrykong terrykong linked an issue Sep 16, 2025 that may be closed by this pull request
@terrykong terrykong added this pull request to the merge queue Sep 16, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to no response for status checks Sep 17, 2025
@terrykong terrykong added this pull request to the merge queue Sep 17, 2025
@chtruong814 chtruong814 removed this pull request from the merge queue due to the queue being cleared Sep 18, 2025
@chtruong814 chtruong814 added this pull request to the merge queue Sep 18, 2025
@chtruong814 chtruong814 removed this pull request from the merge queue due to the queue being cleared Sep 18, 2025
@chtruong814 chtruong814 added this pull request to the merge queue Sep 18, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Sep 18, 2025
@terrykong
Copy link
Contributor

this commit passed as part of this merge queue run: https://github.com/NVIDIA-NeMo/RL/actions/runs/17828559136. will manually merge this one

@chtruong814 chtruong814 merged commit 594e700 into main Sep 18, 2025
26 checks passed
@chtruong814 chtruong814 deleted the yukih/vllm-ep-size branch September 18, 2025 17:42
PrinsYin pushed a commit to PrinsYin/RL that referenced this pull request Nov 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enable vLLM PP and EP for DSV3

5 participants