You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,I also have a problem with training TWOSOME in Tomato Salad environment sh scripts/tomato_salad_ppo_llm.sh and encountered the following error:
pygame 2.4.0 (SDL 2.26.4, Python 3.9.20)
Hello from the pygame community. https://www.pygame.org/contribute.html
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.14it/s]
Some parameters are on the meta device because they were offloaded to the cpu.
You shouldn't move a model that is dispatched using accelerate hooks.
Traceback (most recent call last):
File "/home/wenyi/huangyujie/TWOSOME/TWOSOME/twosome/overcooked/ppo_llm_pomdp.py", line 192, in <module>
agent = LLMAgent(normalization_mode=args.normalization_mode, load_8bit=args.load_8bit, task=args.task)
File "/home/wenyi/huangyujie/TWOSOME/TWOSOME/twosome/overcooked/policy_pomdp.py", line 73, in __init__
self.llama = self._init_llama()
File "/home/wenyi/huangyujie/TWOSOME/TWOSOME/twosome/overcooked/policy_pomdp.py", line 92, in _init_llama
model.half().to(self.device)
File "/home/wenyi/.conda/envs/huangyujie_twosome/lib/python3.9/site-packages/accelerate/big_modeling.py", line 456, in wrapper
raise RuntimeError("You can't move a model that has some modules offloaded to cpu or disk.")
RuntimeError: You can't move a model that has some modules offloaded to cpu or disk.
I'm guessing it's because there's not enough space on the GPU, could you give me a little bit of information about the devices that can be supported.And is it possible that the torch version is the reason?
The text was updated successfully, but these errors were encountered:
Thanks for reaching out. I think you are right. Ideally, the training code needs about slightly less than 40GB VRAM which can be trained with an A100 40G. You can try to use a smaller batch size. I do not think torch version will solve the issue.
Thanks for reaching out. I think you are right. Ideally, the training code needs about slightly less than 40GB VRAM which can be trained with an A100 40G. You can try to use a smaller batch size. I do not think torch version will solve the issue.
Thanks for replying! Actually, I have 4 2080Ti 11G, Can I have a try? If possible , how can I modify the code?
I am not 100% sure but I think you can give it a try. But you need to use some model/data/pipeline parallelism trick. Use deepseek might also be helpful. You need to try to add these modules to the current code.
Hi,I also have a problem with training TWOSOME in Tomato Salad environment
sh scripts/tomato_salad_ppo_llm.sh
and encountered the following error:
I'm guessing it's because there's not enough space on the GPU, could you give me a little bit of information about the devices that can be supported.And is it possible that the torch version is the reason?
The text was updated successfully, but these errors were encountered: