feat: track policy training compute throughput #632

ybgao-nvidia · 2025-07-09T21:31:02Z

What does this PR do ?

Allows tracking of FLOPS (floating point operations per second) for policy training, which can be used to estimate MFU.

This PR adds:

The FLOPCounter class: maintains a sum of FLOPs incurred through model training at a batch granularity. After processing each batch, the trainer passes in a tuple (batch size, sequence length) into the counter, which computes and accumulates the total FLOPs.
Invocation of the tracker in the DTensor worker.
Additions in the GRPO, DPO and SFT training scripts to track the time consumed and compute the compute throughput (FLOPs per second) for each training iteration.
Training FLOPs calculation for Qwen2 (previously only Qwen3 is included)
MFU metric logging in console and wandb (as train/train_fp_utilization if enabled)

Limitations:

We have currently only implemented support for DTensor worker.
The conversion from HF's PretrainedConfig to FLOPSConfig only supports Qwen2Config and LlamaConfig

Issues

This PR resolves #59.

Usage

When training with a supported model, we will see the FLOPS in addition to the training time upon completion of each epoch as follows:

📊 Training Results:
  • Loss: 0.0121
  • Avg Reward: 0.3809
  • Mean Generation Length: 285.2734
  • Training FLOPS: 774.91 TFLOPS (96.86 TFLOPS per rank)
  • Training Model Floating Point Utilization: 9.79%

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

nemo_rl/models/policy/lm_policy.py

nemo_rl/utils/flops_formulas.py

nemo_rl/models/policy/dtensor_policy_worker.py

Signed-off-by: Yubo Gao <[email protected]>

nemo_rl/models/policy/dtensor_policy_worker.py

nemo_rl/models/policy/lm_policy.py

jiemingz · 2025-07-23T14:07:15Z

could you also make sure your changes are covered by tests?

RL/tests/unit/models/generation/test_vllm_generation.py

Line 345 in a607b41

def test_vllm_policy_generation(policy, test_input_data, tokenizer):

…arning printing Signed-off-by: Yubo Gao <[email protected]>

Signed-off-by: Yubo Gao <[email protected]>

Signed-off-by: Yubo Gao <[email protected]> Signed-off-by: Qidong Su <[email protected]>

commit b246e55 Author: Youngeun Kwon <[email protected]> Date: Mon Aug 25 15:05:48 2025 -0700 update the script Signed-off-by: Youngeun Kwon <[email protected]> commit 5315a6b Author: Youngeun Kwon <[email protected]> Date: Mon Aug 25 13:59:16 2025 -0700 script update Signed-off-by: Youngeun Kwon <[email protected]> commit 4437402 Author: Youngeun Kwon <[email protected]> Date: Tue Jul 15 17:42:23 2025 -0700 local Signed-off-by: Youngeun Kwon <[email protected]> wip Signed-off-by: Youngeun Kwon <[email protected]> add script Signed-off-by: Youngeun Kwon <[email protected]> update script Signed-off-by: Youngeun Kwon <[email protected]> update script Signed-off-by: Youngeun Kwon <[email protected]> interactive Signed-off-by: Youngeun Kwon <[email protected]> commit b721703 Author: Charlie Truong <[email protected]> Date: Mon Aug 18 11:22:54 2025 -0500 build: Fix pytorch image ref in Dockerfile.ngc_pytorch (NVIDIA-NeMo#936) Signed-off-by: Charlie Truong <[email protected]> commit 70b9666 Author: Charlie Truong <[email protected]> Date: Sun Aug 17 21:17:58 2025 -0500 build: Add Dockerfile that uses NGC pytorch image (NVIDIA-NeMo#897) Signed-off-by: Charlie Truong <[email protected]> commit df31c1b Author: pjin-nvidia <[email protected]> Date: Thu Aug 14 18:34:50 2025 -0700 feat: chunked logprob calculation with deferred fp32 cast to help with OOM (NVIDIA-NeMo#918) Signed-off-by: Peter Jin <[email protected]> commit 83c6bfc Author: yuki <[email protected]> Date: Thu Aug 14 21:48:55 2025 +0800 refactor: split sync/async vllm worker ([1/2] of refactor vllm worker) (NVIDIA-NeMo#900) Signed-off-by: Yuki Huang <[email protected]> commit 9f7825e Author: Rayen <[email protected]> Date: Thu Aug 14 12:38:27 2025 +0800 feat: Add TP to embed_tokens and lm_head for Gemma models (NVIDIA-NeMo#879) Signed-off-by: ruit <[email protected]> commit e1f56c4 Author: Terry Kong <[email protected]> Date: Tue Aug 12 13:09:37 2025 -0700 feat: add diagnostic script for problematic embeddings (NVIDIA-NeMo#896) Signed-off-by: Terry Kong <[email protected]> commit 223bfa8 Author: Gerald Shen <[email protected]> Date: Mon Aug 11 18:19:52 2025 -0700 feat: add nemotron5 sharding (NVIDIA-NeMo#481) Signed-off-by: Terry Kong <[email protected]> Co-authored-by: Terry Kong <[email protected]> commit 18b9e2c Author: Terry Kong <[email protected]> Date: Mon Aug 11 15:08:52 2025 -0700 test: lower step count on gemma nightly test to finish within 4 hours (NVIDIA-NeMo#880) Signed-off-by: Terry Kong <[email protected]> commit 8fd8c96 Author: guyueh1 <[email protected]> Date: Mon Aug 11 10:46:29 2025 -0700 feat: Fix and enhances for Nsight system profiling (NVIDIA-NeMo#865) Signed-off-by: Guyue Huang <[email protected]> commit 2b87def Author: Qidong Su <[email protected]> Date: Fri Aug 8 18:54:20 2025 -0400 fix: OOM in deepscaler1.5b with sequence length = 16/24k (NVIDIA-NeMo#875) Signed-off-by: Qidong Su <[email protected]> commit fecf71e Author: Rayen <[email protected]> Date: Sat Aug 9 06:42:07 2025 +0800 fix: remove tie weight check (NVIDIA-NeMo#700) Signed-off-by: ruit <[email protected]> commit d45ff3f Author: Terry Kong <[email protected]> Date: Fri Aug 8 10:07:02 2025 -0700 test: add deepscaler tests + pipe-clean configs + fix eval for deepscaler (NVIDIA-NeMo#866) Signed-off-by: Terry Kong <[email protected]> commit d73c942 Author: Anna Shors <[email protected]> Date: Fri Aug 8 09:27:15 2025 -0700 feat: qwen3 export to HF (NVIDIA-NeMo#873) Signed-off-by: Abdalgader Abubaker <[email protected]> Signed-off-by: Anna Shors <[email protected]> Co-authored-by: Abdalgader Abubaker <[email protected]> commit e924d33 Author: Shang Wang <[email protected]> Date: Fri Aug 8 12:15:34 2025 -0400 docs: Link uv's installation instructions to uv's website (NVIDIA-NeMo#837) Signed-off-by: Shang Wang <[email protected]> commit bbbb3d6 Author: yuki <[email protected]> Date: Fri Aug 8 23:26:15 2025 +0800 fix: fix non-colocated with cpu_offload enabled (NVIDIA-NeMo#861) Signed-off-by: Yuki Huang <[email protected]> commit 88a399e Author: yuki <[email protected]> Date: Fri Aug 8 14:04:08 2025 +0800 chore: remove old fsdp1 unit test (NVIDIA-NeMo#871) Signed-off-by: Yuki Huang <[email protected]> commit b8a89a9 Author: yuki <[email protected]> Date: Fri Aug 8 13:56:19 2025 +0800 feat: support non-colocated in mcore (NVIDIA-NeMo#613) Signed-off-by: Yuki Huang <[email protected]> commit 5910abb Author: Anna Shors <[email protected]> Date: Thu Aug 7 13:11:43 2025 -0700 feat: support DTensor CP in DPO and SFT (NVIDIA-NeMo#798) Signed-off-by: ashors1 <[email protected]> commit 0988a7d Author: Felipe Vieira Frujeri <[email protected]> Date: Wed Aug 6 22:01:32 2025 -0700 fix: Fix error message in VllmGenerationWorker. (NVIDIA-NeMo#633) Signed-off-by: Felipe Vieira Frujeri <[email protected]> commit 233cc07 Author: Parth Chadha <[email protected]> Date: Wed Aug 6 15:14:22 2025 -0700 fix: force use of eager (disabled cuda graphs) due to convergence issues (NVIDIA-NeMo#857) Signed-off-by: Parth Chadha <[email protected]> commit 0557402 Author: Terry Kong <[email protected]> Date: Wed Aug 6 14:44:29 2025 -0700 chore: 0.3.0 -> 0.4.0rc0 (NVIDIA-NeMo#840) Signed-off-by: Terry Kong <[email protected]> commit 03472a0 Author: Terry Kong <[email protected]> Date: Wed Aug 6 14:43:55 2025 -0700 feat: dockerfile can build hermetically or from build context (NVIDIA-NeMo#799) Signed-off-by: Terry Kong <[email protected]> commit 9af0a52 Author: Anna Shors <[email protected]> Date: Wed Aug 6 12:35:51 2025 -0700 fix: fix grpo + mcore checkpointing without validation (NVIDIA-NeMo#844) Signed-off-by: ashors1 <[email protected]> commit b6269f7 Author: Yubo Gao <[email protected]> Date: Tue Aug 5 16:55:02 2025 -0400 feat: track policy training compute throughput (NVIDIA-NeMo#632) Signed-off-by: Yubo Gao <[email protected]> commit b74c5d0 Author: Wei Du <[email protected]> Date: Tue Aug 5 15:05:13 2025 -0500 feat: save checkpoint before timeout to avoid 4-hour runtime limit (NVIDIA-NeMo#734) Signed-off-by: Wei Du <[email protected]> Signed-off-by: Terry Kong <[email protected]> Co-authored-by: Terry Kong <[email protected]> commit c784dd9 Author: Zhiyu Li <[email protected]> Date: Tue Aug 5 10:47:30 2025 -0700 feat: add data shuffle and random seed option (NVIDIA-NeMo#334) Signed-off-by: Zhiyu Li <[email protected]> Signed-off-by: Zhiyu Li <[email protected]> commit c249efc Author: Abdalgader Abubaker <[email protected]> Date: Tue Aug 5 21:33:28 2025 +0400 docs: fix checkpointing command for megatron->hf export (NVIDIA-NeMo#823) Signed-off-by: abdalgader-a <[email protected]> Signed-off-by: Youngeun Kwon <[email protected]>

Signed-off-by: Yubo Gao <[email protected]> Signed-off-by: Julien Veron Vialard <[email protected]>

Signed-off-by: Yubo Gao <[email protected]>

ybgao-nvidia requested a review from wangshangsam July 9, 2025 21:31

ybgao-nvidia changed the title ~~Track policy training compute throughput~~ feat: track policy training compute throughput Jul 9, 2025

wangshangsam mentioned this pull request Jul 10, 2025

fix: fix nccl P2P initialization error for non-colocated #636

Merged

ybgao-nvidia requested review from jiemingz and parthchadha July 10, 2025 17:49

ybgao-nvidia force-pushed the main branch from a798be2 to 9bb3996 Compare July 10, 2025 17:58

ybgao-nvidia marked this pull request as ready for review July 10, 2025 18:02

parthchadha reviewed Jul 10, 2025

View reviewed changes

nemo_rl/models/policy/lm_policy.py Outdated Show resolved Hide resolved

jiemingz reviewed Jul 11, 2025

View reviewed changes

nemo_rl/utils/flops_formulas.py Outdated Show resolved Hide resolved

nemo_rl/utils/flops_formulas.py Show resolved Hide resolved

nemo_rl/utils/flops_formulas.py Outdated Show resolved Hide resolved

ybgao-nvidia requested review from jiemingz and parthchadha July 11, 2025 05:46

jiemingz reviewed Jul 15, 2025

View reviewed changes

nemo_rl/models/policy/dtensor_policy_worker.py Outdated Show resolved Hide resolved

ybgao-nvidia requested a review from jiemingz July 15, 2025 17:04

wangshangsam assigned ybgao-nvidia Jul 16, 2025

ybgao-nvidia added 8 commits July 17, 2025 15:47

Implement FLOPs counter for DTensor policy worker

68b93cf

Signed-off-by: Yubo Gao <[email protected]>

Fix import

feae724

Signed-off-by: Yubo Gao <[email protected]>

Make linter happy

95439ee

Signed-off-by: Yubo Gao <[email protected]>

Make linter even happier

7a3591a

Signed-off-by: Yubo Gao <[email protected]>

Include MFU computation for training

ad3ebe3

Signed-off-by: Yubo Gao <[email protected]>

use actual token count to compute FLOPs excluding padding

c23bcb0

Signed-off-by: Yubo Gao <[email protected]>

removed incompatible models; get vocab size from config

83e3344

Signed-off-by: Yubo Gao <[email protected]>

Move FLOPs tracker logic to lm_policy

3138e00

Signed-off-by: Yubo Gao <[email protected]>

ybgao-nvidia force-pushed the main branch from 75e1246 to 3138e00 Compare July 17, 2025 19:47

jiemingz reviewed Jul 23, 2025

View reviewed changes

nemo_rl/models/policy/dtensor_policy_worker.py Show resolved Hide resolved

nemo_rl/models/policy/lm_policy.py Outdated Show resolved Hide resolved

added tests, flop formulas, megatron backend support, and optimized w…

463e308

…arning printing Signed-off-by: Yubo Gao <[email protected]>

ybgao-nvidia requested a review from jiemingz July 23, 2025 20:36

jiemingz previously approved these changes Jul 24, 2025

View reviewed changes

Merge branch 'main' into main

e71127f

Signed-off-by: Yubo Gao <[email protected]>

ybgao-nvidia dismissed jiemingz’s stale review via e71127f July 24, 2025 17:45

jiemingz self-requested a review July 28, 2025 15:00

jiemingz previously approved these changes Jul 28, 2025

View reviewed changes

parthchadha previously approved these changes Jul 28, 2025

View reviewed changes

fix lint

cad22b0

Signed-off-by: Yubo Gao <[email protected]>

ybgao-nvidia dismissed stale reviews from parthchadha and jiemingz via cad22b0 July 28, 2025 22:04

ybgao-nvidia requested review from jiemingz and parthchadha July 28, 2025 22:05

jiemingz previously approved these changes Jul 28, 2025

View reviewed changes

terrykong added this pull request to the merge queue Jul 29, 2025

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jul 29, 2025

don't fail megatron test when theoretical flops unavailable

8fc89de

Signed-off-by: Yubo Gao <[email protected]>

ybgao-nvidia dismissed jiemingz’s stale review via 8fc89de July 29, 2025 22:48

parthchadha approved these changes Jul 29, 2025

View reviewed changes

jiemingz approved these changes Aug 1, 2025

View reviewed changes

terrykong added this pull request to the merge queue Aug 1, 2025

github-merge-queue bot pushed a commit that referenced this pull request Aug 1, 2025

feat: track policy training compute throughput (#632)

5d43a88

Signed-off-by: Yubo Gao <[email protected]>

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 1, 2025

terrykong added this pull request to the merge queue Aug 5, 2025

github-merge-queue bot removed this pull request from the merge queue due to a conflict with the base branch Aug 5, 2025

Merge branch 'main' into main

f9fb913

terrykong added this pull request to the merge queue Aug 5, 2025

terrykong approved these changes Aug 5, 2025

View reviewed changes

github-merge-queue bot pushed a commit that referenced this pull request Aug 5, 2025

feat: track policy training compute throughput (#632)

74259c1

Signed-off-by: Yubo Gao <[email protected]>

Merged via the queue into NVIDIA-NeMo:main with commit b6269f7 Aug 6, 2025
19 checks passed

soodoshll pushed a commit to soodoshll/RL that referenced this pull request Aug 13, 2025

feat: track policy training compute throughput (NVIDIA-NeMo#632)

ff1c38f

Signed-off-by: Yubo Gao <[email protected]> Signed-off-by: Qidong Su <[email protected]>

jveronvialard pushed a commit that referenced this pull request Aug 27, 2025

feat: track policy training compute throughput (#632)

c08ae30

Signed-off-by: Yubo Gao <[email protected]> Signed-off-by: Julien Veron Vialard <[email protected]>

PrinsYin pushed a commit to PrinsYin/RL that referenced this pull request Nov 30, 2025

feat: track policy training compute throughput (NVIDIA-NeMo#632)

14acf59

Signed-off-by: Yubo Gao <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: track policy training compute throughput #632

feat: track policy training compute throughput #632

Uh oh!

ybgao-nvidia commented Jul 9, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jiemingz commented Jul 23, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat: track policy training compute throughput #632

feat: track policy training compute throughput #632

Uh oh!

Conversation

ybgao-nvidia commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jiemingz commented Jul 23, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ybgao-nvidia commented Jul 9, 2025 •

edited

Loading