Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLIPScore is incorrect on captions with more than 77 tokens #2883

Open
chenjy2003 opened this issue Dec 25, 2024 · 2 comments
Open

CLIPScore is incorrect on captions with more than 77 tokens #2883

chenjy2003 opened this issue Dec 25, 2024 · 2 comments
Labels
bug / fix Something isn't working help wanted Extra attention is needed v1.6.x

Comments

@chenjy2003
Copy link

🐛 Bug

If you run CLIPScore between an image and a caption, where the caption has more than 77 tokens (longer than the max string than CLIP can process) -- the clip score is very low (from 26 to 16).

To Reproduce

Code sample
from torch import randint
from torchmetrics.multimodal.clip_score import CLIPScore
torch.manual_seed(0)
image = randint(255, (3, 224, 224))
metric = CLIPScore(model_name_or_path="openai/clip-vit-base-patch16")
print(metric(image, "x " * 74).item()) # 27.398828506469727
print(metric(image, "x " * 75).item()) # 26.158306121826172
print(metric(image, "x " * 76).item()) # 16.974281311035156

Expected behavior

I dug into this issue and observed that the last token should be kept when there are more than 77 tokens. Here is my modified implementation and result.

def _clip_score_update(
    images: Union[Tensor, List[Tensor]],
    text: Union[str, List[str]],
    model: _CLIPModel,
    processor: _CLIPProcessor,
) -> Tuple[Tensor, int]:
    if not isinstance(images, list):
        if images.ndim == 3:
            images = [images]
    else:  # unwrap into list
        images = list(images)

    if not all(i.ndim == 3 for i in images):
        raise ValueError("Expected all images to be 3d but found image that has either more or less")

    if not isinstance(text, list):
        text = [text]

    if len(text) != len(images):
        raise ValueError(
            f"Expected the number of images and text examples to be the same but got {len(images)} and {len(text)}"
        )
    device = images[0].device
    processed_input = processor(text=text, images=[i.cpu() for i in images], return_tensors="pt", padding=True)

    img_features = model.get_image_features(processed_input["pixel_values"].to(device))
    img_features = img_features / img_features.norm(p=2, dim=-1, keepdim=True)

    max_position_embeddings = model.config.text_config.max_position_embeddings
    if processed_input["attention_mask"].shape[-1] > max_position_embeddings:
        rank_zero_warn(
            f"Encountered caption longer than {max_position_embeddings=}. Will truncate captions to this length."
            "If longer captions are needed, initialize argument `model_name_or_path` with a model that supports"
            "longer sequences",
            UserWarning,
        )
        # fix: keep the last token
        mask = torch.arange(processed_input["attention_mask"].shape[-1]) < max_position_embeddings - 1
        mask[-1] = True
        processed_input["attention_mask"] = processed_input["attention_mask"][..., mask]
        processed_input["input_ids"] = processed_input["input_ids"][..., mask]

    txt_features = model.get_text_features(
        processed_input["input_ids"].to(device), processed_input["attention_mask"].to(device)
    )
    txt_features = txt_features / txt_features.norm(p=2, dim=-1, keepdim=True)

    # cosine similarity between feature vectors
    score = 100 * (img_features * txt_features).sum(axis=-1)
    return score, len(text)
custom_metric = CustomCLIPScore(model_name_or_path="openai/clip-vit-base-patch16")
print(custom_metric(image, "x " * 74).item()) # 27.398828506469727
print(custom_metric(image, "x " * 75).item()) # 26.158306121826172
print(custom_metric(image, "x " * 76).item()) # 26.158306121826172

Environment

  • TorchMetrics version (if build from source, add commit SHA): 1.6.0
  • Python & PyTorch Version (e.g., 1.0): Python 3.10.14, PyTorch 2.4.1+cu121
  • Any other relevant information such as OS (e.g., Linux): Linux

Additional context

@chenjy2003 chenjy2003 added bug / fix Something isn't working help wanted Extra attention is needed labels Dec 25, 2024
Copy link

Hi! thanks for your contribution!, great first issue!

@rittik9
Copy link
Contributor

rittik9 commented Jan 5, 2025

I think this problem is there except for model_name_or_path = "openai/clip-vit-base-patch32"

@Borda Borda added the v1.6.x label Jan 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug / fix Something isn't working help wanted Extra attention is needed v1.6.x
Projects
None yet
Development

No branches or pull requests

3 participants