Conversation
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
📝 WalkthroughWalkthroughThe pull request refactors NCCL UB (Undefined Behavior) override handling by extracting it from the Megatron FSDP configuration function into a dedicated helper function, and updates call sites accordingly. Additionally, NCCL environment variable configuration is expanded to include Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
/ok to test 3b98db1 |
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
|
/ok to test acc60cf |
|
Hi, @malay-nagda, @erhoo82, and @ko3n1g, can I ask your approval for the merge? |
| if nccl_ub: | ||
| recipe.ddp.nccl_ub = True | ||
| # The current version of NCCL does not support the AVG operation for reductions with symmetric kernels. | ||
| # To enable symmetric kernels, average_in_collective must be disabled. |
There was a problem hiding this comment.
Can we add a comment saying that TODO: need to remove this condition upon the NCCL support
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Changelog
GitHub Actions CI
See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.
Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information
Summary by CodeRabbit
Performance
Refactor
✏️ Tip: You can customize this high-level summary in your review settings.