-
Notifications
You must be signed in to change notification settings - Fork 766
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update cluster train doc #95
Update cluster train doc #95
Conversation
注NCCL2模式目前仅支持trainer API,NCCL2方式并没有很多可选项,也没有"transpiler",所以并没有底层API。 | ||
使用NCCL2方式同样需要配置每个节点的环境变量,此处与parameter server模式有所不同,并不需要启动独立的\ | ||
parameter server的进程,只需要启动多个trainer进程即可。 | ||
NCCL2模式的分布式训练,由于没有parameter server角色,是trainer之间互相通信,使用是注意: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
使用“时”注意
开启内存优化 | ||
++++++++++++ | ||
|
||
在parameter server分布式训练模式下,要开启内存优化 :code:`memory_optimize` 和单机相比,需要注意按照下面 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
related PR: PaddlePaddle/Paddle#13535 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
以下截图为修改部分预览