Skip to content

Fix: perf script ddp nccl-ub#2158

Merged
youngeunkwon0405 merged 4 commits intomainfrom
fix-perf-script-ddp-nccl-ub
Feb 4, 2026
Merged

Fix: perf script ddp nccl-ub#2158
youngeunkwon0405 merged 4 commits intomainfrom
fix-perf-script-ddp-nccl-ub

Conversation

@youngeunkwon0405
Copy link
Copy Markdown
Contributor

@youngeunkwon0405 youngeunkwon0405 commented Jan 31, 2026

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Changelog

  • Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Related to # (issue)

Summary by CodeRabbit

  • Performance

    • Enhanced NCCL configuration settings for distributed training optimization.
  • Refactor

    • Improved internal organization of configuration management.

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Jan 31, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Jan 31, 2026

📝 Walkthrough

Walkthrough

The pull request refactors NCCL UB (Undefined Behavior) override handling by extracting it from the Megatron FSDP configuration function into a dedicated helper function, and updates call sites accordingly. Additionally, NCCL environment variable configuration is expanded to include NCCL_CTA_POLICY="1" alongside the existing NCCL_NVLS_ENABLE="1" setting.

Changes

Cohort / File(s) Summary
NCCL Configuration
scripts/performance/setup_experiment.py
Enhanced NCCL environment variable configuration to set both NCCL_NVLS_ENABLE="1" and NCCL_CTA_POLICY="1" when nccl_ub is enabled, replacing the previous single-variable assignment.
Override Helper Refactoring
scripts/performance/utils/overrides.py
Extracted nccl_ub parameter from _set_megatron_fsdp_overrides() signature and created new _set_nccl_ub_overrides() helper function to manage NCCL UB-related configuration. Updated two call sites (set_workload_base_configs and set_user_overrides) to invoke both functions separately.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR modifies critical NCCL settings and refactors override functions affecting distributed training performance and convergence, but contains no test results or validation evidence. Add test results and before-and-after performance metrics from training runs with modified NCCL settings to validate no convergence or performance regression occurred.
Title check ❓ Inconclusive The title 'Fix: perf script ddp nccl-ub' is vague and uses abbreviated terminology (ddp, nccl-ub) without clearly explaining what issue is being fixed or what changes are made. Expand the title to be more descriptive, such as 'Fix NCCL UB configuration in performance script by separating FSDP and NCCL override handling' to clearly communicate the primary change.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix-perf-script-ddp-nccl-ub

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@youngeunkwon0405
Copy link
Copy Markdown
Contributor Author

/ok to test 3b98db1

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
@youngeunkwon0405
Copy link
Copy Markdown
Contributor Author

/ok to test acc60cf

@youngeunkwon0405
Copy link
Copy Markdown
Contributor Author

Hi, @malay-nagda, @erhoo82, and @ko3n1g, can I ask your approval for the merge?

if nccl_ub:
recipe.ddp.nccl_ub = True
# The current version of NCCL does not support the AVG operation for reductions with symmetric kernels.
# To enable symmetric kernels, average_in_collective must be disabled.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment saying that TODO: need to remove this condition upon the NCCL support

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working r0.3.0 Cherry-pick label for r0.3.0 release branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants