Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the DistributedFusedLamb optimizer #39148

Merged
merged 24 commits into from
Feb 18, 2022

Conversation

sneaxiy
Copy link
Collaborator

@sneaxiy sneaxiy commented Jan 22, 2022

PR types

New features

PR changes

OPs

Describe

Add hybrid parallel DistributedFusedLamb optimizer.

  • At the beginning, all workers have the whole parameters, the local gradients of the whole parameters, the partial moments.
  • The local gradients of the whole parameter are reduce-scattered. Each worker have the partial reduce-scattered gradients.
  • Each worker calculates the partial trust ratio div tensor using the partial reduce-scattered gradients.
  • If the global norm clip is needed, each worker calculates the local gradient square L2-norm value using the partial reduce-scattered gradients and then calls ncclAllReduce to get the global gradient square L2-norm.
  • Each worker calculates the square L2-norm value of the whole parameter.
  • Each worker calculates the local square L2-norm value of the partial trust ratio div tensor, and then calls ncclAllReduce to get the global square L2-norm value.
  • Each worker updates partial parameter, and call the ncclAllGather to get the whole updated parameter.

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@paddle-bot-old
Copy link

paddle-bot-old bot commented Feb 2, 2022

Sorry to inform you that b9a7f57's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

@sneaxiy sneaxiy changed the title Add DistributedFusedLamb optimizer Add the DistributedFusedLamb optimizer Feb 10, 2022
Copy link
Contributor

@limin2021 limin2021 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Contributor

@Aurelius84 Aurelius84 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for dtype registerar

Copy link
Contributor

@zhiqiu zhiqiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

self._optimizer._set_scale(self._loss_scaling)
optimize_ops = self._optimizer.apply_gradients(params_grads)
found_inf = self._optimizer._found_inf
self._add_dynamic_loss_scaling(params_grads, found_inf)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not important, but it seems params_grads is not used in _add_dynamic_loss_scaling in this case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. But not quite important.

Copy link
Contributor

@XiaoguangHu01 XiaoguangHu01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@TCChenlong TCChenlong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@XieYunshen XieYunshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for set_tests_properties(test_distributed_fused_lamb_op_with_clip PROPERTIES TIMEOUT 120) set_tests_properties(test_distributed_fused_lamb_op_without_clip PROPERTIES TIMEOUT 120)

@sneaxiy sneaxiy merged commit 5df3cd6 into PaddlePaddle:develop Feb 18, 2022
@sneaxiy sneaxiy deleted the add_dist_fused_lamb branch February 18, 2022 23:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants