-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[P1] [Error] can not use bfloat16 and TypeError: Object of type type is not JSON serializable #102
Comments
@mrsempress hey, thanks for raising the issue. on the second problem: the root cause is probably the one identified in #70 -- tensorboard is not well integrated yet. as a result, you need to make sure to run your commend with |
@mrsempress on the first problem, could you make sure there is only 1 GPU visible on your machine (i know you ran with |
Sorry, the first issue has been updated. The first issue is that bfloat16 cannot be used. |
@mrsempress thanks. For issue 1) could you provide your running script? For issue 2) could you reproduce this error by running the notebook on google colab, and share the error'd colab with me? Thanks! These will help me to root cause the issues here. |
For issue 1) CUDA_VISIBLE_DEVICES=0 python examples/loreft/train.py -task gsm8k -model ../../models/Llama-7b-hf -seed 42 -l all -r 4 -p f7+l7 -e 12 -lr 9e-4 -type NodireftIntervention -gradient_accumulation_steps 4 -batch_size 8 -eval_batch_size 4 --dropout 0.05 --test_split validation --use_normalized_template --greedy_decoding --warmup_ratio 0.00 --weight_decay 0.06 For issue 2), as I reproduce it successfully, the link will be updated. Now when installing pyreft, Colab will prompt "you must restart the runtime in order to use newly installed versions", which will take some time. I only used the original ipynb without modifying the code, so you can also try the experiment. I am not sure if it is due to machine environment issues. |
@mrsempress Thanks.
Could you explain more about the change? Did you change For issue 2), i attached my local notebook which does not encounter this issue: Could you check the version of your |
@mrsempress minor: in terms of memory profile, you could check our publicly released log from wandb. This is for our arithmetic benchmarks; 7B experiments are ran on 40G A100. I also attached Process GPU Memory Allocated (%) here: Please go to the logs, and trace out other details. |
I did not modify |
Thank you for your patient reply. I have understood the actual quantity required for memory, but I need to find out if it is due to bflot16 not being able to be used or if there are other reasons that cause the memory to be too large when running the same command. |
Hey! yes, i think so. I am running with bf16, and that is probably the reason why my MEM is lower. |
@mrsempress what is your torch version? |
@mrsempress hey, this is probably an env issue - to resolve this, maybe create a clean conda env, and install packages in the same versions as i have. here is the
please let me know if the problem still exists. thanks. |
Ok~ |
My torch vision is 2.0.1 |
Ok, I will try. |
After I updated the version to make bfloat16 available, 7B experiments for arithmetic tasks need 52574 GMiB when batch size is 8, but your experiment can be run with 40G A100. That is to say, besides bfloat16, there are other factors that can reduce memory. |
@mrsempress this is expected i think, i am using |
this is the screenshot of one of the publicly released run stats: |
So, should I use the command |
Sorry about the confusion - but I think there is nothing being changed here, since for hyperparameter, what matters is the effective batch size, which is batch size (bounded by the GPU MEM) times the gradient accumulation step (you can set whatever you want to match the effective batch size). I think for the script you sent earlier and my settings, they all have an effective batch size of 32. Note in the paper, we only report effective batch size, not per device batch size. Hope these help. Thanks. |
Thank you for your reply. Now I understand how to set Gradient_cccumulation_steps and batch size. |
For problem 1, I upgrade to torch-2.3.1, it works. |
Thanks for your wonderful model, but I have got some problems.
main_demo.ipynb
, but got the error:I find the issues 69, but I use the
main_demo.ipynb
, so it does not work for me.The text was updated successfully, but these errors were encountered: