added functionality for gbs to scale up with number of gpus by rsalagame-nvidia · Pull Request #2551 · NVIDIA-NeMo/Megatron-Bridge

rsalagame-nvidia · 2026-02-26T02:37:55Z

added functionality for gbs to scale up with number of gpus which was missing previously.

Summary by CodeRabbit

Chores
- Improved batch size scaling logic in performance utilities. The system now intelligently resolves global batch size based on GPU configuration, ensuring better alignment between experiment settings and execution when running performance benchmarks.

copy-pr-bot · 2026-02-26T02:37:59Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-02-26T02:43:30Z

📝 Walkthrough

Walkthrough

Modified the get_exp_name_config function to implement three-path global_batch_size resolution: use provided args value, compute scaled value if GPU count differs from base config, or fall back to base config value. Updates experiment name generation accordingly.

Changes

Cohort / File(s)	Summary
Global Batch Size Resolution Logic `scripts/performance/utils/utils.py`	Modified `get_exp_name_config` to replace simple fallback with conditional logic that checks for explicit override, computes scaled GBS based on GPU count difference, or falls back to base config value. Alters exp_config string formation.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

qwen gbs 2x #2280: Modifies base_config.global_batch_size values that the new three-path resolution logic in get_exp_name_config now reads and depends upon.

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	PR introduces batch size scaling feature but lacks test results, performance metrics, or convergence validation in description.	Add test results, performance benchmarks, batch size validation across GPU configs, and documentation of tested configurations.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: adding functionality for global batch size (GBS) to scale with the number of GPUs, which directly matches the implementation change in the get_exp_name_config function.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch rsalagame_gbs_gpus_scaling_up

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/performance/utils/utils.py`:
- Around line 183-186: The code silently truncates scaled global batch size by
using int(base_config.gbs_scaling_factor * num_gpus); update the logic that
computes gbs_size (when num_gpus != base_config.num_gpus) to compute the
multiplication in integer space using divmod or equivalent (e.g., quotient,
remainder = divmod(base_config.gbs_scaling_factor * num_gpus, 1) or better
compute numerator/denominator so you can divmod exactly) and raise a clear
exception if the result is not an integer so we fail fast; reference the symbols
gbs_size, base_config.gbs_scaling_factor, num_gpus and ensure the error message
makes it obvious this affects the scaling used in set_post_overrides.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 21b02e0 and 183616c.

📒 Files selected for processing (1)

scripts/performance/utils/utils.py

coderabbitai · 2026-02-26T02:43:33Z

scripts/performance/utils/utils.py

+    elif num_gpus != base_config.num_gpus:
+        # Scale GBS with num_gpus so experiment name matches the scaled GBS applied in set_post_overrides
+        gbs_size = int(base_config.gbs_scaling_factor * num_gpus)
+    else:


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's read the target file to see the actual code and context cat -n scripts/performance/utils/utils.py | head -200 | tail -40

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1902

🏁 Script executed:

# Check file size first wc -l scripts/performance/utils/utils.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 109

🏁 Script executed:

# Search for gbs_scaling_factor definition and usage rg "gbs_scaling_factor" --context=3

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1786

🏁 Script executed:

# Search for WorkloadBaseConfig definition and global_batch_size rg "class WorkloadBaseConfig|global_batch_size" --context=2 -A 5

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 50386

🏁 Script executed:

# Get more context around the gbs_scaling_factor usage in overrides.py rg "gbs_scaling_factor" scripts/performance/utils/overrides.py -B 5 -A 5

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 570

🏁 Script executed:

# Look for WorkloadBaseConfig class definition rg "class WorkloadBaseConfig" --context=10 -A 20

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1781

🏁 Script executed:

# Check if there are any test files that test this scaling behavior fd ".*test.*" --type f -e py | xargs grep -l "gbs_scaling_factor\|generate_experiment_config" 2>/dev/null | head -10

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53

🏁 Script executed:

# Find workload_base_configs.py files to understand the actual configs find . -name "*workload_base_configs*" -type f

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 564

🏁 Script executed:

# Let's check what actual base configs exist and their values rg "WorkloadBaseConfig\(" --context=3 scripts/performance/configs/ -A 3 | head -100

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 5232

🏁 Script executed:

# Search for any existing tests related to generate_experiment_config rg "generate_experiment_config" --context=5

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53

🏁 Script executed:

# Check one of the actual workload config files to see structure cat scripts/performance/configs/llama/llama3_workload_base_configs.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 20305

🏁 Script executed:

# Run the verification script from the review comment python - <<'PY' import ast from pathlib import Path def int_lit(node): return node.value if isinstance(node, ast.Constant) and isinstance(node.value, int) else None found = False for path in sorted(Path(".").rglob("*workload_base_configs.py")): try: tree = ast.parse(path.read_text()) except Exception as e: continue for node in ast.walk(tree): if not isinstance(node, ast.Call): continue fn = node.func name = fn.id if isinstance(fn, ast.Name) else (fn.attr if isinstance(fn, ast.Attribute) else None) if name != "WorkloadBaseConfig": continue vals = {kw.arg: int_lit(kw.value) for kw in node.keywords if kw.arg in {"num_gpus", "global_batch_size"}} ng = vals.get("num_gpus") gbs = vals.get("global_batch_size") if ng is None or gbs is None: continue # Check if any common target GPU counts would cause non-exact division problematic = [] for target_gpus in [1, 2, 4, 8, 16, 32, 64, 128, 256, 512]: if ng > 0 and (gbs * target_gpus) % ng != 0: truncated = int((gbs * target_gpus) // ng) exact = (gbs * target_gpus) / ng problematic.append(f"target={target_gpus}: truncates {exact:.3f}→{truncated}") if problematic: found = True print(f"{path.name}:{node.lineno}: num_gpus={ng}, gbs={gbs}") for p in problematic[:3]: # Show first 3 print(f" {p}") if not found: print("✓ No truncation risks found in literal WorkloadBaseConfig definitions") PY

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 140

🏁 Script executed:

# Check if there are any tests for the generate_experiment_config function find . -name "*.py" -path "*/test*" -exec grep -l "generate_experiment_config\|gbs_scaling" {} \;

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53

🏁 Script executed:

# Look at how set_post_overrides uses gbs_scaling_factor to understand the relationship better rg "set_post_overrides" -B 3 -A 15 scripts/performance/utils/overrides.py | head -80

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 542

🏁 Script executed:

# Get the full set_post_overrides function to understand the context sed -n '1,300p' scripts/performance/utils/overrides.py | tail -150

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 7151

🏁 Script executed:

# Get the complete set_post_overrides function rg "def set_post_overrides" -A 80 scripts/performance/utils/overrides.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1836

🏁 Script executed:

# Also check how generate_experiment_config is used - in what context is it called rg "generate_experiment_config" -B 5 -A 5 scripts/performance/utils/

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53

🏁 Script executed:

# Search more broadly for generate_experiment_config usage rg "generate_experiment_config" -B 2 -A 2

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53

🏁 Script executed:

# Check what happens after set_post_overrides - does it use the gbs_size value from generate_experiment_config? cat scripts/performance/utils/utils.py | sed -n '140,200p'

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2385

Validate GBS scaling is exact; fail fast if target GPU count causes non-integral scaling.

The current code computes gbs_size = int(base_config.gbs_scaling_factor * num_gpus), which silently truncates when the scaled batch size is not an integer. While all existing configs (e.g., 128÷8=16.0, 128÷64=2.0, 256÷64=4.0) divide evenly, future configs may not. Computing in integer space with divmod() and raising an explicit error if the scaled GBS is not integral prevents silent data loss and catches config errors early.

Proposed fix

- elif num_gpus != base_config.num_gpus: - # Scale GBS with num_gpus so experiment name matches the scaled GBS applied in set_post_overrides - gbs_size = int(base_config.gbs_scaling_factor * num_gpus) + elif num_gpus != base_config.num_gpus: + # Keep scaling in integer space to avoid float truncation. + scaled_gbs, remainder = divmod(base_config.global_batch_size * num_gpus, base_config.num_gpus) + if remainder != 0: + raise ValueError( + "Scaled global_batch_size is not an integer for the requested GPU count: " + f"global_batch_size={base_config.global_batch_size}, " + f"base_num_gpus={base_config.num_gpus}, num_gpus={num_gpus}" + ) + gbs_size = scaled_gbs

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/performance/utils/utils.py` around lines 183 - 186, The code silently truncates scaled global batch size by using int(base_config.gbs_scaling_factor * num_gpus); update the logic that computes gbs_size (when num_gpus != base_config.num_gpus) to compute the multiplication in integer space using divmod or equivalent (e.g., quotient, remainder = divmod(base_config.gbs_scaling_factor * num_gpus, 1) or better compute numerator/denominator so you can divmod exactly) and raise a clear exception if the result is not an integer so we fail fast; reference the symbols gbs_size, base_config.gbs_scaling_factor, num_gpus and ensure the error message makes it obvious this affects the scaling used in set_post_overrides.

added functionality for gbs to scale up with number of gpus

183616c

rsalagame-nvidia requested review from gautham-kollu, malay-nagda and nv-mollys February 26, 2026 02:37

rsalagame-nvidia self-assigned this Feb 26, 2026

rsalagame-nvidia added the r0.3.0 Cherry-pick label for r0.3.0 release branch label Feb 26, 2026

rsalagame-nvidia closed this Feb 26, 2026

coderabbitai bot reviewed Feb 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added functionality for gbs to scale up with number of gpus#2551

added functionality for gbs to scale up with number of gpus#2551
rsalagame-nvidia wants to merge 1 commit intor0.3.0from
rsalagame_gbs_gpus_scaling_up

rsalagame-nvidia commented Feb 26, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Feb 26, 2026

Uh oh!

coderabbitai bot commented Feb 26, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rsalagame-nvidia commented Feb 26, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Feb 26, 2026

Uh oh!

coderabbitai bot commented Feb 26, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rsalagame-nvidia commented Feb 26, 2026 •

edited by coderabbitai bot

Loading