feat: refit refactoring with zmq and overlapping by ZhiyuLi-Nvidia · Pull Request #1267 · NVIDIA-NeMo/RL

ZhiyuLi-Nvidia · 2025-10-03T10:56:29Z

What does this PR do ?

Summary

As titled, we now use Ray solely for lightweight coordination and notifications, while leveraging ZeroMQ (ZMQ) for high-throughput, peer-to-peer (P2P) inter-process communication (IPC). This design avoids costly serialization and enables superior overlap between computation and data transfer.
Asynchronous Overlap with Ping-Pong Buffers: The weight update process (refit) now operates within an inner generator loop, triggered by notifications from the Ray coordinator. We achieve significant overlap between the weight gathering generator and the actual weight updates by using asynchronous ZMQ send/recv operations combined with a ping-pong buffer. This allows one buffer to be filled with incoming weights while the other is being used for computation, maximizing hardware utilization.
Unified Communication Protocol
The ZMQ-based P2P communication logic has been standardized and applied across multiple components. This includes the async and sync LLM engines and all supported backends: DTensor, DTensor v2, and Mcore.
Significant Code Simplification
By refactoring and aligning all components to use this single, efficient communication protocol, we have greatly simplified the codebase. This effort resulted in a net removal of approximately 170 lines of redundant or complex code.

Performance

dsv3 in mcore:
- prepare_for_generation/total: 16% gain from 61s to 51s
- prepare_for_generation/transfer_and_update_weights: 22% gain from 45s to 35s
llama-70b in mcore:
- prepare_for_generation/total: 7% gain from 8.5 to 7.9 here
- prepare_for_generation/transfer_and_update_weights: 12% from 2.5 to 2.2 here
llama-8b in dtensor v2:
- prepare_for_generation/total: 25% improvement from 4s to 3s
- prepare_for_generation/transfer_and_update_weights: regression from 0.2s to 0.5s
  - It is fine given relevance to changed position of garbage collection clear cache between offload and refit and total time is improved
gemma-3-27b-it in dtensor v2:
- prepare_for_generation/total: 20% gain from 7.7s to 6.4s here
- prepare _for_generation/transfer_and_update_weights: 6% gain from 3.0s to 2.8s here
dsv3 in mcore asyncLLM:
- prepare_for_generation/total: 24% gain from 64.5s to 48.8s here
- prepare_for_generation/transfer_and_update_weights: 27.6% gain from 50.3s to 36.5s here
llama-70b in mcore asyncLLM:
- prepare_for_generation/total: 5% from 8.4 to 8.0here
- prepare_for_generation/transfer_and_update_weights: 4% from 2.45 to 2.35 here

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Summary by CodeRabbit

New Features
- Introduced streaming-based weight updates over ZMQ for refit, enabling direct model weight transfer between training and inference processes.
Performance Improvements
- Reduced memory pressure via buffer-based streaming and alignment, improving refit throughput and stability on multi-GPU setups.
Reliability
- Unified update flow with clearer success aggregation and synchronization, with better handling of timeouts and partial failures.
Chores
- Added ZMQ as a new runtime dependency.

github-actions · 2025-10-03T10:57:03Z

⚠️ File Consistency Check

Check based on commit: 8229960 (PR #1267 from zhiyul/refit_zmq_refactor)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/dtensor_policy_worker.py was modified in this PR, but nemo_rl/models/policy/dtensor_policy_worker_v2.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

Please review if the changes in nemo_rl/models/policy/dtensor_policy_worker.py should also be applied to nemo_rl/models/policy/dtensor_policy_worker_v2.py
Update nemo_rl/models/policy/dtensor_policy_worker_v2.py if necessary to maintain consistency
If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

Modified: nemo_rl/models/policy/dtensor_policy_worker.py
Not modified: nemo_rl/models/policy/dtensor_policy_worker_v2.py

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

coderabbitai · 2025-10-03T11:10:14Z

📝 Walkthrough

Walkthrough

Switches colocated refit from grouped CUDA-IPC handles to ZMQ-based streaming. Introduces ZMQ sockets on policy (REQ) and generation workers (REP), streams packed weights in aligned buffers, and awaits update futures. Refactors interfaces and vLLM paths accordingly; retains NCCL broadcast for non-colocated mode. Adds memory querying utilities and zmq dependency.

Changes

Cohort / File(s)	Summary of Changes
Algorithm: GRPO refit path `nemo_rl/algorithms/grpo.py`	Replace grouped IPC-transfer with streaming via ZMQ: compute buffer_size from config or free memory; call policy.stream_weights_via_ipc_zmq and generation.update_weights_via_ipc_zmq; await Ray futures; keep NCCL broadcast for non-colocated.
Generation interfaces & API `nemo_rl/models/generation/interfaces.py`	Rename/retire update_weights_from_ipc_handles → update_weights_via_ipc_zmq; change return type to list[ray.ObjectRef]; no args.
vLLM backend internals `nemo_rl/models/generation/vllm/vllm_backend.py`	Add ZMQ REP setup (get_zmq_address, maybe_init_zmq). Implement update_weights_via_ipc_zmq loop to receive packed tensors, rebuild CUDA tensors with alignment, load into model, send acks, and clean up. Remove old IPC-handle methods and packing structures. Adjust prepare_refit_info signature.
vLLM generation driver `nemo_rl/models/generation/vllm/vllm_generation.py`	Replace update_weights_from_ipc_handles with update_weights_via_ipc_zmq returning futures; use async/sync worker calls without ipc_handles; raise if workers uninitialized.
vLLM workers (sync/async) `nemo_rl/models/generation/vllm/vllm_worker.py`, `nemo_rl/models/generation/vllm/vllm_worker_async.py`	Replace IPC-handle based RPCs with unified "update_weights_via_ipc_zmq" collective RPC; remove ipc_handles arg; keep bool success semantics internally; simplify control flow.
Policy interfaces `nemo_rl/models/policy/interfaces.py`	Remove prepare_weights_for_ipc and get_weights_ipc_handles. Add stream_weights_via_ipc_zmq method signature returning list[ray.ObjectRef].
LM policy `nemo_rl/models/policy/lm_policy.py`	Remove IPC-handle preparation and retrieval; add get_free_memory_bytes; implement stream_weights_via_ipc_zmq that invokes worker-group streaming and returns futures.
DTensor policy worker `nemo_rl/models/policy/dtensor_policy_worker.py`	Add ZMQ REQ setup (get_zmq_address, maybe_init_zmq); implement stream_weights_via_ipc_zmq using shared impl with DTensor-aware param generator; add get_free_memory_bytes.
Megatron policy worker `nemo_rl/models/policy/megatron_policy_worker.py`	Add ZMQ REQ setup (get_zmq_address, maybe_init_zmq), get_free_memory_bytes; replace get_weights_ipc_handles with stream_weights_via_ipc_zmq using shared impl and HF export; unify metadata handling in prepare_refit_info.
Policy utils `nemo_rl/models/policy/utils.py`	Add calculate_aligned_size; add stream_weights_via_ipc_zmq_impl with double buffering, aligned packing, CUDA sync, ZMQ send/ack loop, cleanup.
Dependencies `pyproject.toml`	Add "zmq" dependency.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Trainer as Trainer (GRPO)
  participant Policy as Policy Worker(s) [REQ]
  participant Gen as Generation Worker(s) [REP]
  note over Trainer,Gen: Colocated refit via ZMQ streaming

  Trainer->>Policy: stream_weights_via_ipc_zmq(buffer_size_bytes)
  Trainer->>Gen: update_weights_via_ipc_zmq()
  par Stream batches
    loop For each packed buffer
      Policy->>Gen: ZMQ SEND(buffer bytes, metadata)
      Gen-->>Policy: ZMQ RECV(ack 0)
      Gen->>Gen: Rebuild tensors, load weights, sync stream
    end
    Policy->>Gen: ZMQ SEND("complete")
    Gen-->>Policy: ZMQ RECV(ack "done")
  end
  Gen-->>Trainer: futures resolved (success flags)
  Policy-->>Trainer: futures resolved
  Trainer->>Trainer: Validate all successes
  alt Any failure
    Trainer-->>Trainer: Raise refit update error
  end

sequenceDiagram
  participant Trainer as Trainer (GRPO)
  participant Policy as Policy
  participant Gen as Generation
  note over Trainer,Gen: Non-colocated path unchanged
  Trainer->>Policy: broadcast_weights_for_collective()
  Policy-->>Gen: NCCL broadcast
  Gen-->>Trainer: success

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60–90 minutes

Possibly related PRs

feat: Support Reward Model based Environments #1026 — Both modify nemo_rl/algorithms/grpo.py; touches GRPO resource/cluster sizing logic, likely adjacent to this refit/control-flow change.

Suggested labels

CI:L1, r0.4.0

Suggested reviewers

terrykong
yfw
yaoyu-33

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 78.72% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Test Results For Major Changes	⚠️ Warning	This PR is a major refactor introducing a new ZMQ-based IPC streaming path, changing multiple public interfaces and return types, and adding a new dependency, which can affect correctness, convergence, and performance. The PR description is essentially the default template and does not include any test results, numerics/convergence validation, or before/after performance data with configuration/context. Given the scope and potential to impact refit stability, GPU memory usage, and throughput, testing information is required to proceed. Therefore, this check fails due to missing testing and benchmarking evidence in the PR description.	Please update the PR description to include: (1) unit/integration test results covering the new streaming path across DTensor/Megatron/vLLM workers in colocated and non-colocated modes, (2) numerics/convergence checks showing parity before/after refit on representative models and dtypes, (3) performance benchmarks (throughput/latency and memory usage) before vs. after with exact hardware, configs, and batch sizes, and (4) a brief note on breaking API changes with a migration plan or compatibility shims. CI logs or reproducible scripts/commands for these measurements should be linked. Once provided, this check can be re-evaluated.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title succinctly captures the core refactoring of the refit process to use ZMQ and buffer overlapping, matching the primary changes in the pull request without extraneous details or noise.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch zhiyul/refit_zmq_refactor

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (8)

nemo_rl/models/generation/interfaces.py (1)

236-238: Keep the IPC update contract abstract

GenerationInterface previously forced subclasses to implement the IPC update hook. Dropping @abstractmethod allows concrete classes to instantiate without overriding, only to fail later when the method is invoked. Please mark update_weights_via_ipc_zmq as abstract (as before) to enforce the contract at construction time.
-    def update_weights_via_ipc_zmq(self) -> list[ray.ObjectRef]:
+    @abstractmethod
+    def update_weights_via_ipc_zmq(self) -> list[ray.ObjectRef]:

nemo_rl/models/generation/vllm/vllm_worker_async.py (1)

753-763: Refresh the docstring to match the new signature

The method no longer accepts ipc_handles, but the docstring still documents that parameter. Please align the docstring with the current API to avoid confusing downstream readers.

-    async def update_weights_via_ipc_zmq_async(
-        self,
-    ) -> bool:
-        """Async version of update_weights_from_ipc_handles.
-
-        Args:
-            ipc_handles (dict): Dictionary mapping device UUIDs (str) to parameter IPC handles.
+    async def update_weights_via_ipc_zmq_async(self) -> bool:
+        """Async counterpart of ``update_weights_via_ipc_zmq`` on the sync worker.
 
         Returns:
             bool: True if weights were successfully updated, False otherwise.
         """

nemo_rl/models/policy/dtensor_policy_worker.py (2)

1718-1724: Return type mismatch: cast NVML free memory to int.

nvml.get_free_memory_bytes returns float; method is annotated to return int.

-        return get_free_memory_bytes(device_idx)
+        return int(get_free_memory_bytes(device_idx))

1726-1751: Make streamed tensors contiguous and validate buffer size.

Add an assert to catch unintended zero-sized buffers.
Ensure tensors are contiguous before packing to avoid view/copy pitfalls.

 def stream_weights_via_ipc_zmq(self, buffer_size_bytes: int = 0) -> None:
     self.maybe_init_zmq()
@@
-        def dtensor_params_generator():
+        def dtensor_params_generator():
             """Generator that yields (name, tensor) pairs, converting DTensors to local tensors."""
             for name, tensor in self.model.state_dict().items():
                 if isinstance(tensor, DTensor):
                     # Convert DTensor to full tensor for streaming
-                    full_tensor = tensor.full_tensor()
-                    # Convert to target dtype
-                    yield name, full_tensor.to(self.dtype, non_blocking=True)
+                    full_tensor = tensor.full_tensor()
+                    # Convert to target dtype and ensure contiguity
+                    yield name, full_tensor.to(self.dtype, non_blocking=True).contiguous()
                 else:
-                    # Convert to target dtype
-                    yield name, tensor.to(self.dtype, non_blocking=True)
+                    # Convert to target dtype and ensure contiguity
+                    yield name, tensor.to(self.dtype, non_blocking=True).contiguous()
@@
-        stream_weights_via_ipc_zmq_impl(
+        assert buffer_size_bytes > 0, "buffer_size_bytes must be > 0"
+        stream_weights_via_ipc_zmq_impl(
             params_generator=dtensor_params_generator(),
             buffer_size_bytes=buffer_size_bytes,
             zmq_socket=self.zmq_socket,
             rank=self.rank,
             worker_name=str(self),
         )

nemo_rl/models/policy/lm_policy.py (1)

659-665: Cast free memory to int for annotated return type.

-        free_memory_bytes = min(ray.get(future) for future in futures)
-        return free_memory_bytes
+        free_memory_bytes = min(ray.get(future) for future in futures)
+        return int(free_memory_bytes)

nemo_rl/models/generation/vllm/vllm_generation.py (1)

751-779: Docstring and return type drift; clarify no-arg API and futures return.

Update to reflect ZMQ streaming and that this returns futures, not bool.

-    def update_weights_via_ipc_zmq(self) -> list[ray.ObjectRef]:
-        """Update weights of the policy using IPC handles, considering tensor parallelism.
-
-        For tp > 1, only the leader in each tensor parallel tied worker group will update weights.
-
-        Args:
-            ipc_handles (dict): Dictionary mapping device UUIDs (str) to parameter IPC handles.
-
-        Returns:
-            bool: True if weights were successfully updated, False otherwise.
-        """
+    def update_weights_via_ipc_zmq(self) -> list[ray.ObjectRef]:
+        """Kick off ZMQ-based IPC weight updates on workers.
+        Returns ray futures; caller should await them."""
@@
-        # Use run_all_workers_single_data since no data needs to be passed
+        # Use run_all_workers_single_data since no data needs to be passed
         futures = self.worker_group.run_all_workers_single_data(
             method_name,
             run_rank_0_only_axes=["tensor_parallel", "pipeline_parallel"],
         )
 
-        # this function should co-work with lm_policy, so we should wait for all futures to complete outside
+        # Wait for completion outside to coordinate with policy side
         return futures

nemo_rl/models/generation/vllm/vllm_worker.py (2)

701-739: Align NVTX tag and error message with new API name; keep collective_rpc call.

-    @wrap_with_nvtx_name("vllm_genertion_worker/update_weights_from_ipc_handles")
+    @wrap_with_nvtx_name("vllm_generation_worker/update_weights_via_ipc_zmq")
     def update_weights_via_ipc_zmq(self) -> bool:
@@
-            if self.cfg["vllm_cfg"]["async_engine"]:
-                raise RuntimeError(
-                    "update_weights_from_ipc_handles cannot be used with async_engine=True. Use update_weights_from_ipc_handles_async instead."
-                )
+            if self.cfg["vllm_cfg"]["async_engine"]:
+                raise RuntimeError(
+                    "update_weights_via_ipc_zmq cannot be used with async_engine=True. Use update_weights_via_ipc_zmq_async instead."
+                )
@@
-            result_or_coro = self.llm.collective_rpc(
-                "update_weights_via_ipc_zmq",
-                args=tuple(),
-            )
+            result_or_coro = self.llm.collective_rpc("update_weights_via_ipc_zmq", args=tuple())

727-733: Propagate worker failure context or raise.

Current print+False hides errors. Consider raising RuntimeError to fail fast in orchestration paths.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 557b7ec and 8229960.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (12)

nemo_rl/algorithms/grpo.py (1 hunks)
nemo_rl/models/generation/interfaces.py (1 hunks)
nemo_rl/models/generation/vllm/vllm_backend.py (2 hunks)
nemo_rl/models/generation/vllm/vllm_generation.py (2 hunks)
nemo_rl/models/generation/vllm/vllm_worker.py (2 hunks)
nemo_rl/models/generation/vllm/vllm_worker_async.py (2 hunks)
nemo_rl/models/policy/dtensor_policy_worker.py (2 hunks)
nemo_rl/models/policy/interfaces.py (1 hunks)
nemo_rl/models/policy/lm_policy.py (1 hunks)
nemo_rl/models/policy/megatron_policy_worker.py (5 hunks)
nemo_rl/models/policy/utils.py (2 hunks)
pyproject.toml (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Follow the Google Python Style Guide for all Python code
Target Python 3.12+ for all Python code in NeMo-RL
Indent Python code with 4 spaces; do not use tabs
Python filenames should be snake_case (e.g., some_file.py)
Class names should be PascalCase
Function and method names should be snake_case
Local variable names should be snake_case; if starting with a number, prefix with k (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE and prefixed with G_ (e.g., G_MY_GLOBAL)
Constants should be UPPER_SNAKE_CASE
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
For public interfaces used outside a file, prefer docstrings over comments
Use comments mainly for code within a function or interfaces local to a file
Commented-out code must include a nearby comment explaining usage and why it is commented out; otherwise remove before merging
Use Google-style docstrings for classes and functions (Sphinx-parseable)
Avoid using reflection when functionality can be easily achieved without it
Limit except clauses to the smallest specific set of exceptions possible
For duck-typing via try/except, keep the try body minimal and use else for main logic
Add the NVIDIA copyright header (with current year) at the top of all Python files, excluding tests/ and test-only scripts

Files:

nemo_rl/models/generation/interfaces.py
nemo_rl/models/generation/vllm/vllm_worker_async.py
nemo_rl/models/generation/vllm/vllm_generation.py
nemo_rl/algorithms/grpo.py
nemo_rl/models/generation/vllm/vllm_worker.py
nemo_rl/models/policy/lm_policy.py
nemo_rl/models/generation/vllm/vllm_backend.py
nemo_rl/models/policy/utils.py
nemo_rl/models/policy/dtensor_policy_worker.py
nemo_rl/models/policy/megatron_policy_worker.py
nemo_rl/models/policy/interfaces.py

nemo_rl/**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

nemo_rl/**/*.py: Do not set non-None configuration defaults in code; YAML is the single source of truth for defaults
Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults
Express configuration optionality via TypedDict using typing.NotRequired
When adding a new config key to a TypedDict subclass, document the key’s purpose, valid values/types, and recommended default in code
For any class or function decorated with @ray.remote, add '# pragma: no cover' on the class/def line (and on remote functions)

Files:

nemo_rl/models/generation/interfaces.py
nemo_rl/models/generation/vllm/vllm_worker_async.py
nemo_rl/models/generation/vllm/vllm_generation.py
nemo_rl/algorithms/grpo.py
nemo_rl/models/generation/vllm/vllm_worker.py
nemo_rl/models/policy/lm_policy.py
nemo_rl/models/generation/vllm/vllm_backend.py
nemo_rl/models/policy/utils.py
nemo_rl/models/policy/dtensor_policy_worker.py
nemo_rl/models/policy/megatron_policy_worker.py
nemo_rl/models/policy/interfaces.py

🧬 Code graph analysis (9)

nemo_rl/models/generation/interfaces.py (3)

nemo_rl/models/generation/vllm/vllm_backend.py (1)

update_weights_via_ipc_zmq (84-161)

nemo_rl/models/generation/vllm/vllm_generation.py (1)

update_weights_via_ipc_zmq (751-779)

nemo_rl/models/generation/vllm/vllm_worker.py (1)

update_weights_via_ipc_zmq (702-738)

nemo_rl/models/generation/vllm/vllm_generation.py (4)

nemo_rl/models/generation/interfaces.py (1)

update_weights_via_ipc_zmq (236-238)

nemo_rl/models/generation/vllm/vllm_backend.py (1)

update_weights_via_ipc_zmq (84-161)

nemo_rl/models/generation/vllm/vllm_worker.py (1)

update_weights_via_ipc_zmq (702-738)

nemo_rl/distributed/worker_groups.py (1)

run_all_workers_single_data (728-772)

nemo_rl/algorithms/grpo.py (9)

nemo_rl/models/policy/dtensor_policy_worker.py (2)

get_free_memory_bytes (1718-1723)

stream_weights_via_ipc_zmq (1727-1751)

nemo_rl/models/policy/lm_policy.py (2)

get_free_memory_bytes (659-664)

stream_weights_via_ipc_zmq (666-671)

nemo_rl/models/policy/megatron_policy_worker.py (2)

get_free_memory_bytes (1625-1630)

stream_weights_via_ipc_zmq (1634-1653)

nemo_rl/utils/nvml.py (1)

get_free_memory_bytes (80-90)

nemo_rl/models/policy/interfaces.py (1)

stream_weights_via_ipc_zmq (161-164)

nemo_rl/models/generation/interfaces.py (1)

update_weights_via_ipc_zmq (236-238)

nemo_rl/models/generation/vllm/vllm_backend.py (1)

update_weights_via_ipc_zmq (84-161)

nemo_rl/models/generation/vllm/vllm_generation.py (1)

update_weights_via_ipc_zmq (751-779)

nemo_rl/models/generation/vllm/vllm_worker.py (1)

update_weights_via_ipc_zmq (702-738)

nemo_rl/models/generation/vllm/vllm_worker.py (3)

nemo_rl/models/generation/interfaces.py (1)

update_weights_via_ipc_zmq (236-238)

nemo_rl/models/generation/vllm/vllm_backend.py (1)

update_weights_via_ipc_zmq (84-161)

nemo_rl/models/generation/vllm/vllm_generation.py (1)

update_weights_via_ipc_zmq (751-779)

nemo_rl/models/policy/lm_policy.py (4)

nemo_rl/models/policy/dtensor_policy_worker.py (2)

get_free_memory_bytes (1718-1723)

stream_weights_via_ipc_zmq (1727-1751)

nemo_rl/models/policy/megatron_policy_worker.py (2)

get_free_memory_bytes (1625-1630)

stream_weights_via_ipc_zmq (1634-1653)

nemo_rl/distributed/worker_groups.py (1)

run_all_workers_single_data (728-772)

nemo_rl/models/policy/interfaces.py (1)

stream_weights_via_ipc_zmq (161-164)

nemo_rl/models/generation/vllm/vllm_backend.py (8)

nemo_rl/models/policy/dtensor_policy_worker.py (4)

get_zmq_address (1699-1701)

report_device_id (1686-1697)

maybe_init_zmq (1703-1708)

prepare_refit_info (1711-1716)

nemo_rl/models/policy/megatron_policy_worker.py (4)

get_zmq_address (1545-1547)

report_device_id (1532-1543)

maybe_init_zmq (1549-1554)

prepare_refit_info (1558-1572)

nemo_rl/models/generation/vllm/vllm_worker.py (3)

report_device_id (681-695)

prepare_refit_info (697-699)

update_weights_via_ipc_zmq (702-738)

nemo_rl/models/generation/interfaces.py (2)

prepare_refit_info (232-234)

update_weights_via_ipc_zmq (236-238)

nemo_rl/models/generation/vllm/vllm_generation.py (2)

prepare_refit_info (732-749)

update_weights_via_ipc_zmq (751-779)

nemo_rl/utils/nsys.py (1)

wrap_with_nvtx_name (82-94)

nemo_rl/models/policy/utils.py (1)

calculate_aligned_size (256-266)

nemo_rl/models/generation/fp8.py (2)

is_fp8_model (202-213)

load_weights (289-316)

nemo_rl/models/policy/dtensor_policy_worker.py (7)

nemo_rl/models/generation/vllm/vllm_backend.py (4)

get_zmq_address (57-59)

report_device_id (52-55)

maybe_init_zmq (61-72)

prepare_refit_info (74-81)

nemo_rl/models/policy/megatron_policy_worker.py (6)

get_zmq_address (1545-1547)

report_device_id (1532-1543)

maybe_init_zmq (1549-1554)

prepare_refit_info (1558-1572)

get_free_memory_bytes (1625-1630)

stream_weights_via_ipc_zmq (1634-1653)

nemo_rl/models/policy/interfaces.py (2)

prepare_refit_info (157-158)

stream_weights_via_ipc_zmq (161-164)

nemo_rl/models/policy/lm_policy.py (3)

prepare_refit_info (648-657)

get_free_memory_bytes (659-664)

stream_weights_via_ipc_zmq (666-671)

nemo_rl/utils/nvml.py (1)

get_free_memory_bytes (80-90)

nemo_rl/utils/nsys.py (1)

wrap_with_nvtx_name (82-94)

nemo_rl/models/policy/utils.py (1)

stream_weights_via_ipc_zmq_impl (269-384)

nemo_rl/models/policy/megatron_policy_worker.py (3)

nemo_rl/models/policy/dtensor_policy_worker.py (5)

get_zmq_address (1699-1701)

report_device_id (1686-1697)

maybe_init_zmq (1703-1708)

get_free_memory_bytes (1718-1723)

stream_weights_via_ipc_zmq (1727-1751)

nemo_rl/models/policy/lm_policy.py (2)

get_free_memory_bytes (659-664)

stream_weights_via_ipc_zmq (666-671)

nemo_rl/models/policy/utils.py (1)

stream_weights_via_ipc_zmq_impl (269-384)

nemo_rl/models/policy/interfaces.py (3)

nemo_rl/models/policy/dtensor_policy_worker.py (1)

stream_weights_via_ipc_zmq (1727-1751)

nemo_rl/models/policy/lm_policy.py (1)

stream_weights_via_ipc_zmq (666-671)

nemo_rl/models/policy/megatron_policy_worker.py (1)

stream_weights_via_ipc_zmq (1634-1653)

🪛 Ruff (0.13.2)

nemo_rl/models/generation/vllm/vllm_generation.py

763-763: Avoid specifying long messages outside the exception class

(TRY003)

nemo_rl/models/generation/vllm/vllm_backend.py

155-155: Consider moving this statement to an else block

(TRY300)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Lint check
GitHub Check: Post submodule check comment / Comment on PR

🔇 Additional comments (2)

nemo_rl/models/policy/dtensor_policy_worker.py (1)

1711-1716: Ensure dtype metadata matches streamed dtype across workers.

You record (shape, self.dtype). That’s fine only if the streaming generator also casts to self.dtype for every tensor, and the receiver reconstructs using the same dtype. Please confirm MegatronPolicyWorker.prepare_refit_info also uses the same dtype to avoid size mismatches.

nemo_rl/models/policy/lm_policy.py (1)

666-672: LGTM: streaming API returns futures as intended.

nemo_rl/algorithms/grpo.py

nemo_rl/models/generation/vllm/vllm_backend.py

nemo_rl/models/policy/dtensor_policy_worker.py

nemo_rl/models/policy/utils.py

pyproject.toml

github-actions · 2025-10-03T17:46:03Z

ℹ️ File Consistency Check

Check based on commit: 0d395af (PR #1267 from zhiyul/refit_zmq_refactor)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

nemo_rl/models/policy/dtensor_policy_worker.py
nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

github-actions · 2025-10-03T17:53:46Z

ℹ️ File Consistency Check

Check based on commit: 3764bbd (PR #1267 from zhiyul/refit_zmq_refactor)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

nemo_rl/models/policy/dtensor_policy_worker.py
nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

github-actions · 2025-10-03T18:30:43Z

ℹ️ File Consistency Check

Check based on commit: 73cfc13 (PR #1267 from zhiyul/refit_zmq_refactor)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

nemo_rl/models/policy/dtensor_policy_worker.py
nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

ZhiyuLi-Nvidia · 2025-10-03T23:08:21Z

@coderabbitai generate docstrings

coderabbitai · 2025-10-03T23:08:30Z

✅ Actions performed

Docstrings generation started.

@ZhiyuLi-Nvidia

Docstrings generation was requested by @ZhiyuLi-Nvidia. * #1267 (comment) The following files were modified: * `nemo_rl/algorithms/grpo.py` * `nemo_rl/models/generation/interfaces.py` * `nemo_rl/models/generation/vllm/vllm_backend.py` * `nemo_rl/models/generation/vllm/vllm_generation.py` * `nemo_rl/models/generation/vllm/vllm_worker.py` * `nemo_rl/models/generation/vllm/vllm_worker_async.py` * `nemo_rl/models/policy/dtensor_policy_worker.py` * `nemo_rl/models/policy/dtensor_policy_worker_v2.py` * `nemo_rl/models/policy/interfaces.py` * `nemo_rl/models/policy/lm_policy.py` * `nemo_rl/models/policy/megatron_policy_worker.py` * `nemo_rl/models/policy/utils.py`

coderabbitai · 2025-10-03T23:09:20Z

Note

Generated docstrings for this pull request at #1270

github-actions · 2025-10-04T00:35:13Z

ℹ️ File Consistency Check

Check based on commit: b9f427b (PR #1267 from zhiyul/refit_zmq_refactor)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

nemo_rl/models/policy/dtensor_policy_worker.py
nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

ZhiyuLi-Nvidia · 2025-10-06T22:55:34Z

@guyueh1 @terrykong @parthchadha @yuki-97 @yfw could you take a review?

github-actions · 2025-10-08T04:23:25Z

ℹ️ File Consistency Check

Check based on commit: 6e17ece (PR #1267 from zhiyul/refit_zmq_refactor)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

nemo_rl/models/policy/dtensor_policy_worker.py
nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

github-actions · 2025-10-08T05:37:55Z

ℹ️ File Consistency Check

Check based on commit: 73b0b0b (PR #1267 from zhiyul/refit_zmq_refactor)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

nemo_rl/models/policy/dtensor_policy_worker.py
nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

github-actions · 2025-10-22T00:52:10Z

ℹ️ File Consistency Check

Check based on commit: 7a85634 (PR #1267 from zhiyul/refit_zmq_refactor)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

nemo_rl/models/policy/dtensor_policy_worker.py
nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

github-actions · 2025-10-22T01:45:35Z

ℹ️ File Consistency Check

Check based on commit: 1da3381 (PR #1267 from zhiyul/refit_zmq_refactor)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

nemo_rl/models/policy/dtensor_policy_worker.py
nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

github-actions · 2025-10-22T06:07:35Z

ℹ️ File Consistency Check

Check based on commit: f319c1c (PR #1267 from zhiyul/refit_zmq_refactor)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

nemo_rl/models/policy/dtensor_policy_worker.py
nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

github-actions · 2025-10-22T06:15:04Z

ℹ️ File Consistency Check

Check based on commit: 364b492 (PR #1267 from zhiyul/refit_zmq_refactor)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

nemo_rl/models/policy/dtensor_policy_worker.py
nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

github-actions · 2025-10-22T06:20:51Z

ℹ️ File Consistency Check

Check based on commit: 62acf7a (PR #1267 from zhiyul/refit_zmq_refactor)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

nemo_rl/models/policy/dtensor_policy_worker.py
nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

ZhiyuLi-Nvidia requested review from a team as code owners October 3, 2025 10:56

coderabbitai bot reviewed Oct 3, 2025

View reviewed changes

euronymous-aithal added the r0.4.0 label Oct 3, 2025

ZhiyuLi-Nvidia force-pushed the zhiyul/refit_zmq_refactor branch from 0d395af to 3764bbd Compare October 3, 2025 17:53

coderabbitai bot mentioned this pull request Oct 3, 2025

📝 Add docstrings to zhiyul/refit_zmq_refactor #1270

Closed

ZhiyuLi-Nvidia force-pushed the zhiyul/refit_zmq_refactor branch from 73cfc13 to b9f427b Compare October 4, 2025 00:34

guyueh1 linked an issue Oct 6, 2025 that may be closed by this pull request

Refit Performance for Large MoE Models #945

Closed

ZhiyuLi-Nvidia requested a review from guyueh1 October 6, 2025 22:54

ZhiyuLi-Nvidia requested review from parthchadha, terrykong, yfw and yuki-97 October 6, 2025 22:55

ZhiyuLi-Nvidia force-pushed the zhiyul/refit_zmq_refactor branch from 73b0b0b to bca5826 Compare October 8, 2025 05:41

ZhiyuLi-Nvidia added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Oct 21, 2025

ZhiyuLi-Nvidia temporarily deployed to nemo-ci October 21, 2025 16:39 — with GitHub Actions Inactive

ZhiyuLi-Nvidia temporarily deployed to nemo-ci October 21, 2025 17:05 — with GitHub Actions Inactive

better error message

7a85634

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

fix lint

1da3381

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

ZhiyuLi-Nvidia added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Oct 22, 2025

ZhiyuLi-Nvidia temporarily deployed to nemo-ci October 22, 2025 01:48 — with GitHub Actions Inactive

ZhiyuLi-Nvidia force-pushed the zhiyul/refit_zmq_refactor branch from f319c1c to 364b492 Compare October 22, 2025 06:14

longer timeout for tests

62acf7a

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

ZhiyuLi-Nvidia force-pushed the zhiyul/refit_zmq_refactor branch from 364b492 to 62acf7a Compare October 22, 2025 06:20

ZhiyuLi-Nvidia added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Oct 22, 2025

ZhiyuLi-Nvidia temporarily deployed to nemo-ci October 22, 2025 06:24 — with GitHub Actions Inactive

ZhiyuLi-Nvidia temporarily deployed to nemo-ci October 22, 2025 06:25 — with GitHub Actions Inactive

ZhiyuLi-Nvidia temporarily deployed to nemo-ci October 22, 2025 07:56 — with GitHub Actions Inactive

terrykong approved these changes Oct 22, 2025

View reviewed changes

terrykong merged commit 3a69c21 into main Oct 22, 2025
40 of 42 checks passed

coderabbitai bot mentioned this pull request Oct 22, 2025

cp: feat: refit refactoring with zmq and overlapping (1267) into r0.4.0 #1409

Merged

coderabbitai bot mentioned this pull request Dec 2, 2025

refactor: Introduce BasePolicyWorker #1585

Merged

4 tasks

This was referenced Dec 15, 2025

fix: split dtensorv1 vllm dependency #1638

Merged

feat: Add SGLang rollout backend and tests #1674

Merged

coderabbitai bot mentioned this pull request Dec 23, 2025

Deadlock during multi-node LoRA finetuning NVIDIA-NeMo/NeMo#15219

Closed

Conversation

ZhiyuLi-Nvidia commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Summary

Performance

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

github-actions bot commented Oct 3, 2025

⚠️ File Consistency Check

⚠️ DTensor Policy Worker Synchronization Warning

Uh oh!

coderabbitai bot commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Oct 3, 2025

ℹ️ File Consistency Check

✅ DTensor Policy Worker Synchronization Check

Uh oh!

github-actions bot commented Oct 3, 2025

ℹ️ File Consistency Check

✅ DTensor Policy Worker Synchronization Check

Uh oh!

github-actions bot commented Oct 3, 2025

ℹ️ File Consistency Check

✅ DTensor Policy Worker Synchronization Check

Uh oh!

ZhiyuLi-Nvidia commented Oct 3, 2025

Uh oh!

coderabbitai bot commented Oct 3, 2025

Uh oh!

coderabbitai bot commented Oct 3, 2025

Uh oh!

github-actions bot commented Oct 4, 2025

ℹ️ File Consistency Check

✅ DTensor Policy Worker Synchronization Check

Uh oh!

ZhiyuLi-Nvidia commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Oct 8, 2025

ℹ️ File Consistency Check

✅ DTensor Policy Worker Synchronization Check

Uh oh!

github-actions bot commented Oct 8, 2025

ℹ️ File Consistency Check

✅ DTensor Policy Worker Synchronization Check

Uh oh!

github-actions bot commented Oct 22, 2025

ℹ️ File Consistency Check

✅ DTensor Policy Worker Synchronization Check

Uh oh!

github-actions bot commented Oct 22, 2025

ℹ️ File Consistency Check

✅ DTensor Policy Worker Synchronization Check

Uh oh!

github-actions bot commented Oct 22, 2025

ℹ️ File Consistency Check

✅ DTensor Policy Worker Synchronization Check

Uh oh!

ZhiyuLi-Nvidia commented Oct 3, 2025 •

edited

Loading

coderabbitai bot commented Oct 3, 2025 •

edited

Loading

ZhiyuLi-Nvidia commented Oct 6, 2025 •

edited

Loading