Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: You can't move a model that has some modules offloaded to cpu or disk. #12

Open
todayisYu opened this issue Oct 19, 2024 · 3 comments

Comments

@todayisYu
Copy link

Hi,I also have a problem with training TWOSOME in Tomato Salad environment sh scripts/tomato_salad_ppo_llm.sh and encountered the following error:

pygame 2.4.0 (SDL 2.26.4, Python 3.9.20)
Hello from the pygame community. https://www.pygame.org/contribute.html
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.14it/s]
Some parameters are on the meta device because they were offloaded to the cpu.
You shouldn't move a model that is dispatched using accelerate hooks.
Traceback (most recent call last):
  File "/home/wenyi/huangyujie/TWOSOME/TWOSOME/twosome/overcooked/ppo_llm_pomdp.py", line 192, in <module>
    agent = LLMAgent(normalization_mode=args.normalization_mode, load_8bit=args.load_8bit, task=args.task)
  File "/home/wenyi/huangyujie/TWOSOME/TWOSOME/twosome/overcooked/policy_pomdp.py", line 73, in __init__
    self.llama = self._init_llama()
  File "/home/wenyi/huangyujie/TWOSOME/TWOSOME/twosome/overcooked/policy_pomdp.py", line 92, in _init_llama
    model.half().to(self.device)
  File "/home/wenyi/.conda/envs/huangyujie_twosome/lib/python3.9/site-packages/accelerate/big_modeling.py", line 456, in wrapper
    raise RuntimeError("You can't move a model that has some modules offloaded to cpu or disk.")
RuntimeError: You can't move a model that has some modules offloaded to cpu or disk.

I'm guessing it's because there's not enough space on the GPU, could you give me a little bit of information about the devices that can be supported.And is it possible that the torch version is the reason?

@WeihaoTan
Copy link
Owner

Thanks for reaching out. I think you are right. Ideally, the training code needs about slightly less than 40GB VRAM which can be trained with an A100 40G. You can try to use a smaller batch size. I do not think torch version will solve the issue.

@todayisYu
Copy link
Author

todayisYu commented Oct 29, 2024

Thanks for reaching out. I think you are right. Ideally, the training code needs about slightly less than 40GB VRAM which can be trained with an A100 40G. You can try to use a smaller batch size. I do not think torch version will solve the issue.

Thanks for replying! Actually, I have 4 2080Ti 11G, Can I have a try? If possible , how can I modify the code?

@WeihaoTan
Copy link
Owner

I am not 100% sure but I think you can give it a try. But you need to use some model/data/pipeline parallelism trick. Use deepseek might also be helpful. You need to try to add these modules to the current code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants