Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update cluster train doc #95

Merged

Conversation

typhoonzero
Copy link
Contributor

@typhoonzero typhoonzero commented Sep 26, 2018

以下截图为修改部分预览

image
image
image
image

注NCCL2模式目前仅支持trainer API,NCCL2方式并没有很多可选项,也没有"transpiler",所以并没有底层API。
使用NCCL2方式同样需要配置每个节点的环境变量,此处与parameter server模式有所不同,并不需要启动独立的\
parameter server的进程,只需要启动多个trainer进程即可。
NCCL2模式的分布式训练,由于没有parameter server角色,是trainer之间互相通信,使用是注意:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

使用“时”注意

开启内存优化
++++++++++++

在parameter server分布式训练模式下,要开启内存优化 :code:`memory_optimize` 和单机相比,需要注意按照下面
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里有一些多余的空格
image

Copy link
Collaborator

@gongweibao gongweibao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@typhoonzero
Copy link
Contributor Author

related PR: PaddlePaddle/Paddle#13535

@typhoonzero
Copy link
Contributor Author

image
image
image

Comments Done.

Copy link
Contributor

@shanyi15 shanyi15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@typhoonzero typhoonzero merged commit 9cfbb06 into PaddlePaddle:develop Sep 29, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants