Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the problem in fleet executor stop #38114

Merged
merged 2 commits into from
Dec 17, 2021

Conversation

LiYuRio
Copy link
Contributor

@LiYuRio LiYuRio commented Dec 14, 2021

PR types

Bug fixes

PR changes

Others

Describe

修复之前多卡下FleetExecutor停不下来的问题,修改对象的持有关系,将MessageBus从单例变为FleetExecutor的成员,由FleetExecutor管理它的生命周期,向Carrier传递MessageBus指针,Interceptor持有拥有者Carrier的指针,一个FleetExecutor未来会拥有多个Carrier,最终在FleetExecutor的析构函数中向Carrier上的源Interceptor发送停止的控制消息。

c7de75c79446a8876311169db385c7fe

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

Copy link
Contributor

@FeixLiu FeixLiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

用GPT对一下pp=dp=mp=2的精度再合吧,以后的pr都跑一下精度再合?

@@ -29,9 +29,8 @@ void InterceptorMessageServiceImpl::InterceptorMessageService(
VLOG(3) << "Interceptor Message Service receives a message from interceptor "
<< request->src_id() << " to interceptor " << request->dst_id()
<< ", with the message: " << request->message_type();
FleetExecutor::GetCarrier().EnqueueInterceptorMessage(*request);
response->set_rst(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

突然想到,这其实可以set_rst为EnqueueInterceptorMessage的返回值,下个pr可以改一下。

Copy link
Contributor

@wangxicoding wangxicoding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@FeixLiu
Copy link
Contributor

FeixLiu commented Dec 17, 2021

rebase一下最新develop分支跑一下精度吧 🧐

Copy link
Contributor

@FeixLiu FeixLiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@chenwhql chenwhql left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for PADDLE_ENFORCE

@FeixLiu FeixLiu merged commit 843435f into PaddlePaddle:develop Dec 17, 2021
@LiYuRio LiYuRio deleted the dev_fix_fleet_executor_stop branch December 31, 2021 10:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants