refactor: split sync/async vllm worker ([1/2] of refactor vllm worker) #900

yuki-97 · 2025-08-12T16:01:57Z

What does this PR do ?

step 1 of refactor vllm worker:

put vllm stuffs from nemo_rl/models/generation to nemo_rl/models/generation/vllm, so that it's easy for us to support other inference FW in the future.
split sync and async vllm worker to different files to make it clear.

Test Result

	convergence	time
FSDP2 with sync vllm
Megatron with async vllm

Issues

Related #599.

nemo_rl/models/generation/vllm/__init__.py

nemo_rl/models/generation/vllm/vllm_worker_async.py

Signed-off-by: Yuki Huang <[email protected]>

#900) Signed-off-by: Yuki Huang <[email protected]>

commit b246e55 Author: Youngeun Kwon <[email protected]> Date: Mon Aug 25 15:05:48 2025 -0700 update the script Signed-off-by: Youngeun Kwon <[email protected]> commit 5315a6b Author: Youngeun Kwon <[email protected]> Date: Mon Aug 25 13:59:16 2025 -0700 script update Signed-off-by: Youngeun Kwon <[email protected]> commit 4437402 Author: Youngeun Kwon <[email protected]> Date: Tue Jul 15 17:42:23 2025 -0700 local Signed-off-by: Youngeun Kwon <[email protected]> wip Signed-off-by: Youngeun Kwon <[email protected]> add script Signed-off-by: Youngeun Kwon <[email protected]> update script Signed-off-by: Youngeun Kwon <[email protected]> update script Signed-off-by: Youngeun Kwon <[email protected]> interactive Signed-off-by: Youngeun Kwon <[email protected]> commit b721703 Author: Charlie Truong <[email protected]> Date: Mon Aug 18 11:22:54 2025 -0500 build: Fix pytorch image ref in Dockerfile.ngc_pytorch (NVIDIA-NeMo#936) Signed-off-by: Charlie Truong <[email protected]> commit 70b9666 Author: Charlie Truong <[email protected]> Date: Sun Aug 17 21:17:58 2025 -0500 build: Add Dockerfile that uses NGC pytorch image (NVIDIA-NeMo#897) Signed-off-by: Charlie Truong <[email protected]> commit df31c1b Author: pjin-nvidia <[email protected]> Date: Thu Aug 14 18:34:50 2025 -0700 feat: chunked logprob calculation with deferred fp32 cast to help with OOM (NVIDIA-NeMo#918) Signed-off-by: Peter Jin <[email protected]> commit 83c6bfc Author: yuki <[email protected]> Date: Thu Aug 14 21:48:55 2025 +0800 refactor: split sync/async vllm worker ([1/2] of refactor vllm worker) (NVIDIA-NeMo#900) Signed-off-by: Yuki Huang <[email protected]> commit 9f7825e Author: Rayen <[email protected]> Date: Thu Aug 14 12:38:27 2025 +0800 feat: Add TP to embed_tokens and lm_head for Gemma models (NVIDIA-NeMo#879) Signed-off-by: ruit <[email protected]> commit e1f56c4 Author: Terry Kong <[email protected]> Date: Tue Aug 12 13:09:37 2025 -0700 feat: add diagnostic script for problematic embeddings (NVIDIA-NeMo#896) Signed-off-by: Terry Kong <[email protected]> commit 223bfa8 Author: Gerald Shen <[email protected]> Date: Mon Aug 11 18:19:52 2025 -0700 feat: add nemotron5 sharding (NVIDIA-NeMo#481) Signed-off-by: Terry Kong <[email protected]> Co-authored-by: Terry Kong <[email protected]> commit 18b9e2c Author: Terry Kong <[email protected]> Date: Mon Aug 11 15:08:52 2025 -0700 test: lower step count on gemma nightly test to finish within 4 hours (NVIDIA-NeMo#880) Signed-off-by: Terry Kong <[email protected]> commit 8fd8c96 Author: guyueh1 <[email protected]> Date: Mon Aug 11 10:46:29 2025 -0700 feat: Fix and enhances for Nsight system profiling (NVIDIA-NeMo#865) Signed-off-by: Guyue Huang <[email protected]> commit 2b87def Author: Qidong Su <[email protected]> Date: Fri Aug 8 18:54:20 2025 -0400 fix: OOM in deepscaler1.5b with sequence length = 16/24k (NVIDIA-NeMo#875) Signed-off-by: Qidong Su <[email protected]> commit fecf71e Author: Rayen <[email protected]> Date: Sat Aug 9 06:42:07 2025 +0800 fix: remove tie weight check (NVIDIA-NeMo#700) Signed-off-by: ruit <[email protected]> commit d45ff3f Author: Terry Kong <[email protected]> Date: Fri Aug 8 10:07:02 2025 -0700 test: add deepscaler tests + pipe-clean configs + fix eval for deepscaler (NVIDIA-NeMo#866) Signed-off-by: Terry Kong <[email protected]> commit d73c942 Author: Anna Shors <[email protected]> Date: Fri Aug 8 09:27:15 2025 -0700 feat: qwen3 export to HF (NVIDIA-NeMo#873) Signed-off-by: Abdalgader Abubaker <[email protected]> Signed-off-by: Anna Shors <[email protected]> Co-authored-by: Abdalgader Abubaker <[email protected]> commit e924d33 Author: Shang Wang <[email protected]> Date: Fri Aug 8 12:15:34 2025 -0400 docs: Link uv's installation instructions to uv's website (NVIDIA-NeMo#837) Signed-off-by: Shang Wang <[email protected]> commit bbbb3d6 Author: yuki <[email protected]> Date: Fri Aug 8 23:26:15 2025 +0800 fix: fix non-colocated with cpu_offload enabled (NVIDIA-NeMo#861) Signed-off-by: Yuki Huang <[email protected]> commit 88a399e Author: yuki <[email protected]> Date: Fri Aug 8 14:04:08 2025 +0800 chore: remove old fsdp1 unit test (NVIDIA-NeMo#871) Signed-off-by: Yuki Huang <[email protected]> commit b8a89a9 Author: yuki <[email protected]> Date: Fri Aug 8 13:56:19 2025 +0800 feat: support non-colocated in mcore (NVIDIA-NeMo#613) Signed-off-by: Yuki Huang <[email protected]> commit 5910abb Author: Anna Shors <[email protected]> Date: Thu Aug 7 13:11:43 2025 -0700 feat: support DTensor CP in DPO and SFT (NVIDIA-NeMo#798) Signed-off-by: ashors1 <[email protected]> commit 0988a7d Author: Felipe Vieira Frujeri <[email protected]> Date: Wed Aug 6 22:01:32 2025 -0700 fix: Fix error message in VllmGenerationWorker. (NVIDIA-NeMo#633) Signed-off-by: Felipe Vieira Frujeri <[email protected]> commit 233cc07 Author: Parth Chadha <[email protected]> Date: Wed Aug 6 15:14:22 2025 -0700 fix: force use of eager (disabled cuda graphs) due to convergence issues (NVIDIA-NeMo#857) Signed-off-by: Parth Chadha <[email protected]> commit 0557402 Author: Terry Kong <[email protected]> Date: Wed Aug 6 14:44:29 2025 -0700 chore: 0.3.0 -> 0.4.0rc0 (NVIDIA-NeMo#840) Signed-off-by: Terry Kong <[email protected]> commit 03472a0 Author: Terry Kong <[email protected]> Date: Wed Aug 6 14:43:55 2025 -0700 feat: dockerfile can build hermetically or from build context (NVIDIA-NeMo#799) Signed-off-by: Terry Kong <[email protected]> commit 9af0a52 Author: Anna Shors <[email protected]> Date: Wed Aug 6 12:35:51 2025 -0700 fix: fix grpo + mcore checkpointing without validation (NVIDIA-NeMo#844) Signed-off-by: ashors1 <[email protected]> commit b6269f7 Author: Yubo Gao <[email protected]> Date: Tue Aug 5 16:55:02 2025 -0400 feat: track policy training compute throughput (NVIDIA-NeMo#632) Signed-off-by: Yubo Gao <[email protected]> commit b74c5d0 Author: Wei Du <[email protected]> Date: Tue Aug 5 15:05:13 2025 -0500 feat: save checkpoint before timeout to avoid 4-hour runtime limit (NVIDIA-NeMo#734) Signed-off-by: Wei Du <[email protected]> Signed-off-by: Terry Kong <[email protected]> Co-authored-by: Terry Kong <[email protected]> commit c784dd9 Author: Zhiyu Li <[email protected]> Date: Tue Aug 5 10:47:30 2025 -0700 feat: add data shuffle and random seed option (NVIDIA-NeMo#334) Signed-off-by: Zhiyu Li <[email protected]> Signed-off-by: Zhiyu Li <[email protected]> commit c249efc Author: Abdalgader Abubaker <[email protected]> Date: Tue Aug 5 21:33:28 2025 +0400 docs: fix checkpointing command for megatron->hf export (NVIDIA-NeMo#823) Signed-off-by: abdalgader-a <[email protected]> Signed-off-by: Youngeun Kwon <[email protected]>

#900) Signed-off-by: Yuki Huang <[email protected]> Signed-off-by: Julien Veron Vialard <[email protected]>

NVIDIA-NeMo#900) Signed-off-by: Yuki Huang <[email protected]> Signed-off-by: Qidong Su <[email protected]>

NVIDIA-NeMo#900) Signed-off-by: Yuki Huang <[email protected]>

yuki-97 changed the title ~~refactor: move files and split sync/async vllm worker ([1/2] of refactor vllm worker)~~ refactor: split sync/async vllm worker ([1/2] of refactor vllm worker) Aug 12, 2025

yuki-97 force-pushed the yukih/refactor-vllm-step-1 branch 2 times, most recently from c32df49 to 6041a5b Compare August 12, 2025 16:08

yuki-97 added the CI:L1 Run doctests, unit tests, and functional tests label Aug 12, 2025

yuki-97 temporarily deployed to nemo-ci August 12, 2025 16:19 — with GitHub Actions Inactive

yuki-97 temporarily deployed to nemo-ci August 12, 2025 17:37 — with GitHub Actions Inactive

github-actions bot added the documentation Improvements or additions to documentation label Aug 13, 2025

yuki-97 force-pushed the yukih/refactor-vllm-step-1 branch from 01d2f21 to c09ddad Compare August 13, 2025 03:33

yuki-97 added CI:docs Run doctest and removed CI:L1 Run doctests, unit tests, and functional tests labels Aug 13, 2025

yuki-97 temporarily deployed to nemo-ci August 13, 2025 03:36 — with GitHub Actions Inactive

yuki-97 temporarily deployed to nemo-ci August 13, 2025 03:37 — with GitHub Actions Inactive

yuki-97 marked this pull request as ready for review August 13, 2025 03:52

yuki-97 requested review from parthchadha and terrykong August 13, 2025 03:52

terrykong reviewed Aug 13, 2025

View reviewed changes

nemo_rl/models/generation/vllm/__init__.py Outdated Show resolved Hide resolved

nemo_rl/models/generation/vllm/vllm_worker_async.py Show resolved Hide resolved

terrykong approved these changes Aug 14, 2025

View reviewed changes

yuki-97 added 6 commits August 14, 2025 21:34

move files

b55f541

Signed-off-by: Yuki Huang <[email protected]>

remove duplicated code

73ee68c

Signed-off-by: Yuki Huang <[email protected]>

lint

c7c47ff

Signed-off-by: Yuki Huang <[email protected]>

fix path

51db8c2

Signed-off-by: Yuki Huang <[email protected]>

fix doc

8d0d7eb

Signed-off-by: Yuki Huang <[email protected]>

use absolute import

7cb2463

Signed-off-by: Yuki Huang <[email protected]>

yuki-97 force-pushed the yukih/refactor-vllm-step-1 branch from 3a2a649 to 7cb2463 Compare August 14, 2025 13:34

parthchadha approved these changes Aug 14, 2025

View reviewed changes

parthchadha added this pull request to the merge queue Aug 14, 2025

Merged via the queue into main with commit 83c6bfc Aug 14, 2025
19 checks passed

parthchadha deleted the yukih/refactor-vllm-step-1 branch August 14, 2025 16:53

zhandaz pushed a commit that referenced this pull request Aug 19, 2025

refactor: split sync/async vllm worker ([1/2] of refactor vllm worker) (

8f05264

#900) Signed-off-by: Yuki Huang <[email protected]>

jveronvialard pushed a commit that referenced this pull request Aug 27, 2025

refactor: split sync/async vllm worker ([1/2] of refactor vllm worker) (

ba29d45

#900) Signed-off-by: Yuki Huang <[email protected]> Signed-off-by: Julien Veron Vialard <[email protected]>

soodoshll pushed a commit to soodoshll/RL that referenced this pull request Sep 4, 2025

refactor: split sync/async vllm worker ([1/2] of refactor vllm worker) (

c13e168

NVIDIA-NeMo#900) Signed-off-by: Yuki Huang <[email protected]> Signed-off-by: Qidong Su <[email protected]>

yuki-97 mentioned this pull request Sep 17, 2025

Refactor: separate sync/async vllm #599

Open

PrinsYin pushed a commit to PrinsYin/RL that referenced this pull request Nov 30, 2025

refactor: split sync/async vllm worker ([1/2] of refactor vllm worker) (

6fa5436

NVIDIA-NeMo#900) Signed-off-by: Yuki Huang <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor: split sync/async vllm worker ([1/2] of refactor vllm worker) #900

refactor: split sync/async vllm worker ([1/2] of refactor vllm worker) #900

Uh oh!

yuki-97 commented Aug 12, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

refactor: split sync/async vllm worker ([1/2] of refactor vllm worker) #900

refactor: split sync/async vllm worker ([1/2] of refactor vllm worker) #900

Uh oh!

Conversation

yuki-97 commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Test Result

Issues

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yuki-97 commented Aug 12, 2025 •

edited

Loading