Fix sync_bn, get the correct nccl comm. #45100
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR types
Bug fixes
PR changes
Others
Describe
在新通信库中,sync_bn的语义退化成bn,因为从context中得到的nccl_comm为空,导致完全没有经过all reduce操作做同步。这个PR中,从全局的ProcessGroup中得到nccl comm做通信,如果没有全局的nccl comm则退化成原来的语义。
Sync batch norm has wrong syntax with new comm library for we can not get nccl_comm from context anymore. Fix this by getting nccl_comm from global process group.