G/benchmark #176

pimpale · 2025-10-21T20:56:32Z

Note

Introduces a benchmark mode that reports rolling throughput, MFU, and memory per step using a new PerfCounter and GPU peak FLOPs detection, with aggregated results displayed as a table.

Training:
- Add TrainingConfig.benchmark to enable benchmark mode (caps steps to 5) and collect per-step metrics: throughput (tok/s), mfu (%), peak_memory (GB), and step_duration.
- Integrate PerfCounter into the training loop; count tokens per step, compute rolling throughput and MFU; aggregate metrics across ranks and render a summary table.
Performance:
- New hud/rl/perf.py implementing PerfCounter with rolling window, FLOPs/token estimation from PretrainedConfig, GPU peak FLOPs lookup, and a singleton accessor.
Utils:
- Add get_peak_flops(device_name) to detect GPU via lspci and return BF16 peak FLOPs for common GPUs; fallback with warnings.
Tests:
- Update test runner to use a smaller base model, enable benchmark mode, increase steps, and generate rollouts for multiple steps.

^{Written by Cursor Bugbot for commit bed6223. This will update automatically on new commits. Configure here.}

promptless · 2025-10-21T21:14:48Z

📝 Documentation updates detected!

New suggestion: Add comprehensive benchmark mode documentation for PR #176
Updated existing suggestion: Document RL pipeline bugfixes from PR #142

jdchawla29 · 2025-10-23T08:16:59Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2025-10-23T08:21:19Z

hud/rl/perf.py

+    def get_mfu(self) -> float | None:
+        tokens_per_second = self.get_tokens_per_second()
+        if tokens_per_second is None:
+            return None
+        return 100 * self.num_flop_per_token * tokens_per_second / self.gpu_peak_flops / get_world_size()


Remove extra world-size division in MFU calculation

Each rank already computes MFU in PerfCounter.get_mfu as 100 * flop_per_token * tokens_per_second / gpu_peak_flops / get_world_size(). When the results are gathered later they are averaged across ranks, so the world-size factor is applied twice. With two ranks, a device running at 60% MFU will be reported as only 30%, making the new benchmark output misleading. MFU should be computed per device (no division by world size) and then averaged or aggregated once.

Useful? React with 👍 / 👎.

cursor · 2025-10-23T23:05:35Z

hud/rl/perf.py

+        tokens_per_second = self.get_tokens_per_second()
+        if tokens_per_second is None:
+            return None
+        return 100 * self.num_flop_per_token * tokens_per_second / self.gpu_peak_flops / get_world_size()


Bug: Double Normalization in Distributed MFU Calculation

The MFU calculation in PerfCounter divides by world_size per rank. These already normalized MFU values are then summed across ranks in train.py, leading to an incorrect, double-normalized MFU metric in distributed training.

Additional Locations (1)

hud/rl/train.py#L233-L234

pimpale added 4 commits October 15, 2025 23:58

create benchmark mode

5370cf3

perf

282a07d

stuff

c0d4d86

Merge branch 'j/rl-stuff' into g/benchmark

716bd70

This comment was marked as outdated.

Sign in to view

hud-evals deleted a comment from chatgpt-codex-connector bot Oct 23, 2025

chatgpt-codex-connector bot reviewed Oct 23, 2025

View reviewed changes

fix

6734f97

This comment was marked as outdated.

Sign in to view

don't log num_tokens

bed6223

cursor bot reviewed Oct 23, 2025

View reviewed changes

jdchawla29 deleted the branch j/rl-stuff November 6, 2025 02:03

jdchawla29 closed this Nov 6, 2025

jdchawla29 deleted the g/benchmark branch November 6, 2025 02:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

G/benchmark #176

G/benchmark #176

Uh oh!

pimpale commented Oct 21, 2025 •

edited by cursor bot

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

promptless bot commented Oct 21, 2025

Uh oh!

jdchawla29 commented Oct 23, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Oct 23, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

cursor bot Oct 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

G/benchmark #176

G/benchmark #176

Uh oh!

Conversation

pimpale commented Oct 21, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

promptless bot commented Oct 21, 2025

Uh oh!

jdchawla29 commented Oct 23, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

cursor bot Oct 23, 2025

Choose a reason for hiding this comment

Bug: Double Normalization in Distributed MFU Calculation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pimpale commented Oct 21, 2025 •

edited by cursor bot

Loading