Skip to content

Conversation

@pimpale
Copy link
Collaborator

@pimpale pimpale commented Oct 21, 2025

Note

Introduces a benchmark mode that reports rolling throughput, MFU, and memory per step using a new PerfCounter and GPU peak FLOPs detection, with aggregated results displayed as a table.

  • Training:
    • Add TrainingConfig.benchmark to enable benchmark mode (caps steps to 5) and collect per-step metrics: throughput (tok/s), mfu (%), peak_memory (GB), and step_duration.
    • Integrate PerfCounter into the training loop; count tokens per step, compute rolling throughput and MFU; aggregate metrics across ranks and render a summary table.
  • Performance:
    • New hud/rl/perf.py implementing PerfCounter with rolling window, FLOPs/token estimation from PretrainedConfig, GPU peak FLOPs lookup, and a singleton accessor.
  • Utils:
    • Add get_peak_flops(device_name) to detect GPU via lspci and return BF16 peak FLOPs for common GPUs; fallback with warnings.
  • Tests:
    • Update test runner to use a smaller base model, enable benchmark mode, increase steps, and generate rollouts for multiple steps.

Written by Cursor Bugbot for commit bed6223. This will update automatically on new commits. Configure here.

cursor[bot]

This comment was marked as outdated.

@promptless
Copy link
Contributor

promptless bot commented Oct 21, 2025

📝 Documentation updates detected!

New suggestion: Add comprehensive benchmark mode documentation for PR #176
Updated existing suggestion: Document RL pipeline bugfixes from PR #142

@hud-evals hud-evals deleted a comment from chatgpt-codex-connector bot Oct 23, 2025
@jdchawla29
Copy link
Collaborator

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +46 to +50
def get_mfu(self) -> float | None:
tokens_per_second = self.get_tokens_per_second()
if tokens_per_second is None:
return None
return 100 * self.num_flop_per_token * tokens_per_second / self.gpu_peak_flops / get_world_size()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Remove extra world-size division in MFU calculation

Each rank already computes MFU in PerfCounter.get_mfu as 100 * flop_per_token * tokens_per_second / gpu_peak_flops / get_world_size(). When the results are gathered later they are averaged across ranks, so the world-size factor is applied twice. With two ranks, a device running at 60% MFU will be reported as only 30%, making the new benchmark output misleading. MFU should be computed per device (no division by world size) and then averaged or aggregated once.

Useful? React with 👍 / 👎.

cursor[bot]

This comment was marked as outdated.

tokens_per_second = self.get_tokens_per_second()
if tokens_per_second is None:
return None
return 100 * self.num_flop_per_token * tokens_per_second / self.gpu_peak_flops / get_world_size()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Double Normalization in Distributed MFU Calculation

The MFU calculation in PerfCounter divides by world_size per rank. These already normalized MFU values are then summed across ranks in train.py, leading to an incorrect, double-normalized MFU metric in distributed training.

Additional Locations (1)

Fix in Cursor Fix in Web

@jdchawla29 jdchawla29 deleted the branch j/rl-stuff November 6, 2025 02:03
@jdchawla29 jdchawla29 closed this Nov 6, 2025
@jdchawla29 jdchawla29 deleted the g/benchmark branch November 6, 2025 02:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants