You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm working with the CLIP model and would like to train the image and text encoders using different loss functions (i.e., image_total_loss for the image encoder and text_total_loss for the text encoder - instead of the combined total_loss).
To achieve this, I plan to use two separate optimizers, one for each encoder, so each optimizer updates its respective encoder based on its specific loss function.
The challenge I'm facing is:
I can't find a way to distinguish between the parameters of the image encoder and the text encoder within the CLIP model. When I inspect the model's parameters, they all seem to be part of a single collection, and there's no clear separation.
My questions are:
Is there a way to separately access the parameters of the image encoder and the text encoder in the CLIP model?
How can I set up two optimizers that individually track the parameters of each encoder?
There is another way to train CLIP encoders with different loss functions?
This discussion was converted from issue #946 on October 03, 2024 04:11.
Heading
Bold
Italic
Quote
Code
Link
Numbered list
Unordered list
Task list
Attach files
Mention
Reference
Menu
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hello,
I'm working with the CLIP model and would like to train the image and text encoders using different loss functions (i.e.,
image_total_loss
for the image encoder andtext_total_loss
for the text encoder - instead of the combinedtotal_loss
).To achieve this, I plan to use two separate optimizers, one for each encoder, so each optimizer updates its respective encoder based on its specific loss function.
The challenge I'm facing is:
I can't find a way to distinguish between the parameters of the image encoder and the text encoder within the CLIP model. When I inspect the model's parameters, they all seem to be part of a single collection, and there's no clear separation.
My questions are:
Thank you!
Beta Was this translation helpful? Give feedback.
All reactions