-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Command to run train_flatT5.py #643
Comments
Hi this is the commands we use on a 8 GPU cluster, please fill data_path with your data, e.g. playground/data/dummy.json. The script will preprocessed data and store it in preprocessed_path so that future runs can directly load from it. You can also specify a path you like. let me know if there is any issue: |
Hi, @DachengLi1 Thanks for providing the commands for fine-tuning FlanT5. I used the commands you provided, the training phase is fine. But when it save out the model, it will encounter CUDA error. I find there is a comment "# potential bug for T5 model" in the code: FastChat/fastchat/train/train_flant5.py Line 79 in cad445e
Is it an already known issue for saving out the model?
|
@gary9630 I think this is likely due to a PyTorch FSDP bug that causes OOM when saving. Are you able to save intermediate checkpoints (PyTorch will continue training, even if the intermediate checkpoint saving cause OOM, you may need to actually monitor a saving step to check this)? If that happens, it can be solved by the solution mentioned here. And thanks for checking the comment. There is another (smaller) issue with Flan-t5+FSDP saving. After you resolve this issue and can save the model correctly, use this function to clean up the weight path. Under the hood, it seems FSDP has some trouble when saving shared weights. And this function manually correct shared weights(T5 encoder embedding, T5 decoder embedding and shared embedding are actually the same thing, but has three names). Then you should be able to load the model! |
@DachengLi1 Thank you for such a fast response. Regarding the OOM issue, I've already solved by the method you provided when fine-tuning Vicuna's model few days ago. It is really helpful and saving my day, appreciate all your hard work and excellent Github community. I am not sure it is correct or not, but I modified the code:
to
I can get the model output successfully: But when I try to load the model, it can not be loaded with the error messages:
I am not sure is this error related to the second part you mentioned that I have to clean the weight path. For the function you provided, I just need to give the current ckpt path and it will do the magic cleaning job for me, right? Again, thank you for such great work, I really enjoy it! |
Nice to hear that! Yes, this is exactly the second issue. Can you try to call the function above (You may want to copy the weight in case anything unexpected happens. This function will rewrite some of the weight in the path you provide)? |
I can now successfully load the fine-tuned model of FlanT5 now! Thank you @DachengLi1 For the record, I've done following things:
Then, you should be able load the fine-tuned FlanT5 model for inferencing. |
@gary9630 Glad it helps! Thanks for the summarization, we will probably redirect any other related issues to this solution. Closing this issue now, feel free to re-open it if you find other issues. |
@DachengLi1 could you send a PR and add some docs for using the T5 training scripts? |
After cleaning the checkpoint, when I try to use the weights I get the following error: ValueError: Unrecognized configuration class <class 'transformers.models.t5.configuration_t5.T5Config'> for this kind of Does someone know how to solve it? |
I guess it is the folder naming you used to save the model weight does not contain " It should be OK when you rename the model path. For more details, I encountered this error before and based on the code I traced for inferencing:
|
What is the argument and command line for running the fine tuning code of Flan T5?
The text was updated successfully, but these errors were encountered: