CLIPScore is incorrect on captions with more than 77 tokens #2883

chenjy2003 · 2024-12-25T08:46:24Z

🐛 Bug

If you run CLIPScore between an image and a caption, where the caption has more than 77 tokens (longer than the max string than CLIP can process) -- the clip score is very low (from 26 to 16).

To Reproduce

Code sample

from torch import randint
from torchmetrics.multimodal.clip_score import CLIPScore
torch.manual_seed(0)
image = randint(255, (3, 224, 224))
metric = CLIPScore(model_name_or_path="openai/clip-vit-base-patch16")
print(metric(image, "x " * 74).item()) # 27.398828506469727
print(metric(image, "x " * 75).item()) # 26.158306121826172
print(metric(image, "x " * 76).item()) # 16.974281311035156

Expected behavior

I dug into this issue and observed that the last token should be kept when there are more than 77 tokens. Here is my modified implementation and result.

def _clip_score_update(
    images: Union[Tensor, List[Tensor]],
    text: Union[str, List[str]],
    model: _CLIPModel,
    processor: _CLIPProcessor,
) -> Tuple[Tensor, int]:
    if not isinstance(images, list):
        if images.ndim == 3:
            images = [images]
    else:  # unwrap into list
        images = list(images)

    if not all(i.ndim == 3 for i in images):
        raise ValueError("Expected all images to be 3d but found image that has either more or less")

    if not isinstance(text, list):
        text = [text]

    if len(text) != len(images):
        raise ValueError(
            f"Expected the number of images and text examples to be the same but got {len(images)} and {len(text)}"
        )
    device = images[0].device
    processed_input = processor(text=text, images=[i.cpu() for i in images], return_tensors="pt", padding=True)

    img_features = model.get_image_features(processed_input["pixel_values"].to(device))
    img_features = img_features / img_features.norm(p=2, dim=-1, keepdim=True)

    max_position_embeddings = model.config.text_config.max_position_embeddings
    if processed_input["attention_mask"].shape[-1] > max_position_embeddings:
        rank_zero_warn(
            f"Encountered caption longer than {max_position_embeddings=}. Will truncate captions to this length."
            "If longer captions are needed, initialize argument `model_name_or_path` with a model that supports"
            "longer sequences",
            UserWarning,
        )
        # fix: keep the last token
        mask = torch.arange(processed_input["attention_mask"].shape[-1]) < max_position_embeddings - 1
        mask[-1] = True
        processed_input["attention_mask"] = processed_input["attention_mask"][..., mask]
        processed_input["input_ids"] = processed_input["input_ids"][..., mask]

    txt_features = model.get_text_features(
        processed_input["input_ids"].to(device), processed_input["attention_mask"].to(device)
    )
    txt_features = txt_features / txt_features.norm(p=2, dim=-1, keepdim=True)

    # cosine similarity between feature vectors
    score = 100 * (img_features * txt_features).sum(axis=-1)
    return score, len(text)

custom_metric = CustomCLIPScore(model_name_or_path="openai/clip-vit-base-patch16")
print(custom_metric(image, "x " * 74).item()) # 27.398828506469727
print(custom_metric(image, "x " * 75).item()) # 26.158306121826172
print(custom_metric(image, "x " * 76).item()) # 26.158306121826172

Environment

TorchMetrics version (if build from source, add commit SHA): 1.6.0
Python & PyTorch Version (e.g., 1.0): Python 3.10.14, PyTorch 2.4.1+cu121
Any other relevant information such as OS (e.g., Linux): Linux

Additional context

github-actions · 2024-12-25T08:46:44Z

Hi! thanks for your contribution!, great first issue!

rittik9 · 2025-01-05T20:18:05Z

I think this problem is there except for model_name_or_path = "openai/clip-vit-base-patch32"

chenjy2003 added bug / fix Something isn't working help wanted Extra attention is needed labels Dec 25, 2024

Borda added the v1.6.x label Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLIPScore is incorrect on captions with more than 77 tokens #2883

CLIPScore is incorrect on captions with more than 77 tokens #2883

chenjy2003 commented Dec 25, 2024

github-actions bot commented Dec 25, 2024

rittik9 commented Jan 5, 2025

CLIPScore is incorrect on captions with more than 77 tokens #2883

CLIPScore is incorrect on captions with more than 77 tokens #2883

Comments

chenjy2003 commented Dec 25, 2024

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

github-actions bot commented Dec 25, 2024

rittik9 commented Jan 5, 2025