You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If you run CLIPScore between an image and a caption, where the caption has more than 77 tokens (longer than the max string than CLIP can process) -- the clip score is very low (from 26 to 16).
I dug into this issue and observed that the last token should be kept when there are more than 77 tokens. Here is my modified implementation and result.
def_clip_score_update(
images: Union[Tensor, List[Tensor]],
text: Union[str, List[str]],
model: _CLIPModel,
processor: _CLIPProcessor,
) ->Tuple[Tensor, int]:
ifnotisinstance(images, list):
ifimages.ndim==3:
images= [images]
else: # unwrap into listimages=list(images)
ifnotall(i.ndim==3foriinimages):
raiseValueError("Expected all images to be 3d but found image that has either more or less")
ifnotisinstance(text, list):
text= [text]
iflen(text) !=len(images):
raiseValueError(
f"Expected the number of images and text examples to be the same but got {len(images)} and {len(text)}"
)
device=images[0].deviceprocessed_input=processor(text=text, images=[i.cpu() foriinimages], return_tensors="pt", padding=True)
img_features=model.get_image_features(processed_input["pixel_values"].to(device))
img_features=img_features/img_features.norm(p=2, dim=-1, keepdim=True)
max_position_embeddings=model.config.text_config.max_position_embeddingsifprocessed_input["attention_mask"].shape[-1] >max_position_embeddings:
rank_zero_warn(
f"Encountered caption longer than {max_position_embeddings=}. Will truncate captions to this length.""If longer captions are needed, initialize argument `model_name_or_path` with a model that supports""longer sequences",
UserWarning,
)
# fix: keep the last tokenmask=torch.arange(processed_input["attention_mask"].shape[-1]) <max_position_embeddings-1mask[-1] =Trueprocessed_input["attention_mask"] =processed_input["attention_mask"][..., mask]
processed_input["input_ids"] =processed_input["input_ids"][..., mask]
txt_features=model.get_text_features(
processed_input["input_ids"].to(device), processed_input["attention_mask"].to(device)
)
txt_features=txt_features/txt_features.norm(p=2, dim=-1, keepdim=True)
# cosine similarity between feature vectorsscore=100* (img_features*txt_features).sum(axis=-1)
returnscore, len(text)
🐛 Bug
If you run CLIPScore between an image and a caption, where the caption has more than 77 tokens (longer than the max string than CLIP can process) -- the clip score is very low (from 26 to 16).
To Reproduce
Code sample
Expected behavior
I dug into this issue and observed that the last token should be kept when there are more than 77 tokens. Here is my modified implementation and result.
Environment
Additional context
The text was updated successfully, but these errors were encountered: