Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gradient issue #13

Open
TonyXuQAQ opened this issue Sep 15, 2023 · 9 comments
Open

Gradient issue #13

TonyXuQAQ opened this issue Sep 15, 2023 · 9 comments

Comments

@TonyXuQAQ
Copy link

Hi, after going through the training code, it seems that the gradient is not properly backpropagated. It seems that all projector layers mm_projectorare called within torch.no_grad (i.e., call_1, call_2). If so, it means the projector layer is not trained at all, right? Is this a typo in the released code or an error?

@RupertLuo
Copy link
Owner

Can you share the error output and training configuration file?

@TonyXuQAQ
Copy link
Author

There is no error. I just used the raw code of this repo. I mean, the projector mm_projector layer seems not been trained properly in valley/model/valley.py. All mm_projector are wrapped in torch.no_grad so that the projector will not be trained, since the gradient is blocked within torch.no_grad.

@RupertLuo
Copy link
Owner

image In the file train.py, You can set whether need to update the projector.

@TonyXuQAQ
Copy link
Author

But projectors are wrapped inside torch.no_grad. So the gradient cannot pass the projector, i.e., the projector is not trained. And you did not use this layer elsewhere. I wonder how you trained this projector.
Screenshot from 2023-09-21 10-50-57

@feymanpriv
Copy link
Collaborator

But projectors are wrapped inside torch.no_grad. So the gradient cannot pass the projector, i.e., the projector is not trained. And you did not use this layer elsewhere. I wonder how you trained this projector. Screenshot from 2023-09-21 10-50-57

@TonyXuQAQ I find the projector is not wrapped inside 'torch.no_grad' in the original code of this repo, as follows:
image in
[https://github.com/RupertLuo/Valley/blob/8da73a9551cd9ce520c47f7c3f508fdfc387f4f8/valley/model/valley.py].
I guess the "bug" is caused by reorganizing the codes. And the projector should be outside the 'torch.no_grad' as the released models are trained with tuning projector.

@TonyXuQAQ
Copy link
Author

Thanks for the information.

During finetuning, I also noticed that your current version code cannot load VideoChat-instruct-11K normally. Because LLaVA-instruct-150K's label is organized as "{'human":... "gpt":...}", but VideoChat-instruct-11K's label is organized as "{'q':..., 'a':...}". These two datasets have different label formats. But your code did not do format transformation. I guess you missed the label pre-processing code.

I don't know how, based on your llama-2-pretrain weights, I finetuned valley on the above two datasets and the results are very bad. I will refer to the early commits of this repo for debugging.

@TonyXuQAQ
Copy link
Author

So may I know which commit is used to train the provided valley-2-7b? I just want to re-implement the performance of the provided checkpoints

@RupertLuo
Copy link
Owner

Thanks for the information.

During finetuning, I also noticed that your current version code cannot load VideoChat-instruct-11K normally. Because LLaVA-instruct-150K's label is organized as "{'human":... "gpt":...}", but VideoChat-instruct-11K's label is organized as "{'q':..., 'a':...}". These two datasets have different label formats. But your code did not do format transformation. I guess you missed the label pre-processing code.

I don't know how, based on your llama-2-pretrain weights, I finetuned valley on the above two datasets and the results are very bad. I will refer to the early commits of this repo for debugging.

LLaVA-instruct-150k should be able to load. For videochat-11k, you need to convert the format to LLaVA-instruct-150k.

@RupertLuo
Copy link
Owner

So may I know which commit is used to train the provided valley-2-7b? I just want to re-implement the performance of the provided checkpoints

Thank you for your continued attention to this project. I will synchronize it to the code that can be perfectly trained as soon as possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants