Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code bugs #18

Open
itongggg opened this issue Nov 10, 2024 · 4 comments
Open

Code bugs #18

itongggg opened this issue Nov 10, 2024 · 4 comments

Comments

@itongggg
Copy link

In your relora.py I found that for every relora layer, the B matrix is initialized as a zero matrix, it is same as standard setting,
however, i also found
截屏2024-11-10 09 00 28
when you wrap a model as a relora model, the matrix A is also initialized as a zero matrix, is it a typo ?

@ShuDun23
Copy link

It seems they want the wrapped model to be exactly the same as the original one if keep_original_weights, otherwise lora_A.weight is initialized as kaiming in ReLoRaLinear, but even so, B times A is still zero. So it seems to me not a typo but a redundancy?

@itongggg
Copy link
Author

It seems they want the wrapped model to be exactly the same as the original one if keep_original_weights, otherwise lora_A.weight is initialized as kaiming in ReLoRaLinear, but even so, B times A is still zero. So it seems to me not a typo but a redundancy?

but if A and B both are initialized with zero weights, the training process are stuck? since the gradient of A euqals to B^T\frac{\partial L}{\partial W} and the gradient of B euqals to \frac{\partial L}{\partial W}A^T , in this case your gradients for A and B would be zero all time.

@ShuDun23
Copy link

ShuDun23 commented Nov 15, 2024

Oh, even though both A and B are zero-initialized, as you mentioned, the updates will be slow at first due to the small gradients. However, the gradients are not zero because of the presence of the original W, so they can still be gradually updated. I think the authors might intend to do this?

@itongggg
Copy link
Author

Oh, even though both A and B are zero-initialized, as you mentioned, the updates will be slow at first due to the small gradients. However, the gradients are not zero because of the presence of the original W, so they can still be gradually updated. I think the authors might intend to do this?

as i mentioned before the gradient of A = B^TG and B = GA^T. and G is the gradient of W so if you both initialize the A and B zero, it would never update the parameters of A and B

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants