Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

solve the RuntimeError: Tensors must be CUDA and dense #33

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

13416157913
Copy link

1、update Megatron-LLaMA/megatron/core/parallel_state.py
2、update Megatron-LLaMA/megatron/optimizer/overlapped_dist_optimizer.py
3、update Megatron-LLaMA/megatron/optimizer/distrib_optimizer.py

Add world_size in the initialize_model_parallel function, to judge gloo or nccl
Add world size  in save_parameter_state function to judge cpu or cuda.
Add world size in the save_parameter_state function, to judge cpu or cuda.
@li-yi-dong
Copy link
Collaborator

这个PR 想解决什么问题?

@13416157913
Copy link
Author

@li-yi-dong 解决多节点分布式训练时使用nccl后端,在训练完后,保存检查点时报错的问题;是以下这个issues
#32

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants