-
Notifications
You must be signed in to change notification settings - Fork 43
G/benchmark #176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
G/benchmark #176
Conversation
|
📝 Documentation updates detected! New suggestion: Add comprehensive benchmark mode documentation for PR #176 |
|
@codex review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| def get_mfu(self) -> float | None: | ||
| tokens_per_second = self.get_tokens_per_second() | ||
| if tokens_per_second is None: | ||
| return None | ||
| return 100 * self.num_flop_per_token * tokens_per_second / self.gpu_peak_flops / get_world_size() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove extra world-size division in MFU calculation
Each rank already computes MFU in PerfCounter.get_mfu as 100 * flop_per_token * tokens_per_second / gpu_peak_flops / get_world_size(). When the results are gathered later they are averaged across ranks, so the world-size factor is applied twice. With two ranks, a device running at 60% MFU will be reported as only 30%, making the new benchmark output misleading. MFU should be computed per device (no division by world size) and then averaged or aggregated once.
Useful? React with 👍 / 👎.
| tokens_per_second = self.get_tokens_per_second() | ||
| if tokens_per_second is None: | ||
| return None | ||
| return 100 * self.num_flop_per_token * tokens_per_second / self.gpu_peak_flops / get_world_size() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Double Normalization in Distributed MFU Calculation
The MFU calculation in PerfCounter divides by world_size per rank. These already normalized MFU values are then summed across ranks in train.py, leading to an incorrect, double-normalized MFU metric in distributed training.
Note
Introduces a benchmark mode that reports rolling throughput, MFU, and memory per step using a new PerfCounter and GPU peak FLOPs detection, with aggregated results displayed as a table.
TrainingConfig.benchmarkto enable benchmark mode (caps steps to 5) and collect per-step metrics:throughput(tok/s),mfu(%),peak_memory(GB), andstep_duration.PerfCounterinto the training loop; count tokens per step, compute rolling throughput and MFU; aggregate metrics across ranks and render a summary table.hud/rl/perf.pyimplementingPerfCounterwith rolling window, FLOPs/token estimation fromPretrainedConfig, GPU peak FLOPs lookup, and a singleton accessor.get_peak_flops(device_name)to detect GPU vialspciand return BF16 peak FLOPs for common GPUs; fallback with warnings.Written by Cursor Bugbot for commit bed6223. This will update automatically on new commits. Configure here.