RuntimeError: You can't move a model that has some modules offloaded to cpu or disk. #12

todayisYu · 2024-10-19T05:09:46Z

Hi，I also have a problem with training TWOSOME in Tomato Salad environment sh scripts/tomato_salad_ppo_llm.sh and encountered the following error:

pygame 2.4.0 (SDL 2.26.4, Python 3.9.20)
Hello from the pygame community. https://www.pygame.org/contribute.html
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.14it/s]
Some parameters are on the meta device because they were offloaded to the cpu.
You shouldn't move a model that is dispatched using accelerate hooks.
Traceback (most recent call last):
  File "/home/wenyi/huangyujie/TWOSOME/TWOSOME/twosome/overcooked/ppo_llm_pomdp.py", line 192, in <module>
    agent = LLMAgent(normalization_mode=args.normalization_mode, load_8bit=args.load_8bit, task=args.task)
  File "/home/wenyi/huangyujie/TWOSOME/TWOSOME/twosome/overcooked/policy_pomdp.py", line 73, in __init__
    self.llama = self._init_llama()
  File "/home/wenyi/huangyujie/TWOSOME/TWOSOME/twosome/overcooked/policy_pomdp.py", line 92, in _init_llama
    model.half().to(self.device)
  File "/home/wenyi/.conda/envs/huangyujie_twosome/lib/python3.9/site-packages/accelerate/big_modeling.py", line 456, in wrapper
    raise RuntimeError("You can't move a model that has some modules offloaded to cpu or disk.")
RuntimeError: You can't move a model that has some modules offloaded to cpu or disk.

I'm guessing it's because there's not enough space on the GPU, could you give me a little bit of information about the devices that can be supported.And is it possible that the torch version is the reason?

The text was updated successfully, but these errors were encountered:

WeihaoTan · 2024-10-23T13:52:23Z

Thanks for reaching out. I think you are right. Ideally, the training code needs about slightly less than 40GB VRAM which can be trained with an A100 40G. You can try to use a smaller batch size. I do not think torch version will solve the issue.

todayisYu · 2024-10-29T12:18:50Z

Thanks for reaching out. I think you are right. Ideally, the training code needs about slightly less than 40GB VRAM which can be trained with an A100 40G. You can try to use a smaller batch size. I do not think torch version will solve the issue.

Thanks for replying! Actually, I have 4 2080Ti 11G, Can I have a try? If possible , how can I modify the code？

WeihaoTan · 2024-10-30T05:33:32Z

I am not 100% sure but I think you can give it a try. But you need to use some model/data/pipeline parallelism trick. Use deepseek might also be helpful. You need to try to add these modules to the current code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: You can't move a model that has some modules offloaded to cpu or disk. #12

RuntimeError: You can't move a model that has some modules offloaded to cpu or disk. #12

todayisYu commented Oct 19, 2024

WeihaoTan commented Oct 23, 2024

todayisYu commented Oct 29, 2024 •

edited

Loading

WeihaoTan commented Oct 30, 2024

RuntimeError: You can't move a model that has some modules offloaded to cpu or disk. #12

RuntimeError: You can't move a model that has some modules offloaded to cpu or disk. #12

Comments

todayisYu commented Oct 19, 2024

WeihaoTan commented Oct 23, 2024

todayisYu commented Oct 29, 2024 • edited Loading

WeihaoTan commented Oct 30, 2024

todayisYu commented Oct 29, 2024 •

edited

Loading