-
Notifications
You must be signed in to change notification settings - Fork 203
feat: optimize refit by preparing refit info ahead of time #638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@YUki-666 |
61abcf9 to
de56749
Compare
de56749 to
f57e799
Compare
f57e799 to
fc4d64e
Compare
Yup, I added it in ebb874a. |
ZhiyuLi-Nvidia
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @YUki-666. LGTM!
9c0e833 to
e9b22fd
Compare
Signed-off-by: Yuki Huang <[email protected]>
Signed-off-by: Yuki Huang <[email protected]>
Signed-off-by: Yuki Huang <[email protected]>
Signed-off-by: Yuki Huang <[email protected]>
Signed-off-by: Yuki Huang <[email protected]>
Signed-off-by: Yuki Huang <[email protected]>
Signed-off-by: Yuki Huang <[email protected]>
…mcore for speedup Signed-off-by: Yuki Huang <[email protected]>
c685241 to
f152f61
Compare
Signed-off-by: Yuki Huang <[email protected]>
Signed-off-by: Yuki Huang <[email protected]> Signed-off-by: Zhiyu Li <[email protected]>
…Mo#638) Signed-off-by: Yuki Huang <[email protected]> Signed-off-by: Jialei Chen <[email protected]>
)" This reverts commit 8f7d71e
Signed-off-by: Yuki Huang <[email protected]>
…Mo#638) Signed-off-by: Yuki Huang <[email protected]>
…Mo#638) Signed-off-by: Yuki Huang <[email protected]>
…Mo#638) Signed-off-by: Yuki Huang <[email protected]>
…Mo#638) Signed-off-by: Yuki Huang <[email protected]> Signed-off-by: Qidong Su <[email protected]>
Separate the refit process changes from #613.
What does this PR do ?
e_score_correction_bias) will change during training, have some special handle with it, andrefit_param_info_mcoreis not cached for now because of this.Test Result
convergence
time cost
In mcore w/ packing (dsv3 w/ 64 tp)
*The ~20s overhead is due to offload.

Refit Process Changes
Colocated
Previous
prepare_weights_for_ipcin train side.get_weights_ipc_handlesin train side andupdate_weights_from_ipc_handlesin inference side.Now
prepare_refit_infoin train side.prepare_weights_for_ipcin train side.get_weights_ipc_handlesin train side andupdate_weights_from_ipc_handlesin inference side.Non-colocated
Previous
init_collectivein both train and inference side.prepare_info_for_collectivein train side.broadcast_weights_for_collectivein train side andupdate_weights_from_collectivein inference side.Now
init_collectivein both train and inference side.prepare_refit_infoin both train and inference side.broadcast_weights_for_collectivein train side andupdate_weights_from_collectivein inference side.