Skip to content

clip-eval is a tool for evaluating CLIP models on various image classification and image-text retrieval tasks in Japanese.

License

Notifications You must be signed in to change notification settings

llm-jp/clip-eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

clip-eval

clip-eval is a tool for evaluating CLIP models on various image classification and image-text retrieval tasks in Japanese.

Installation

rye sync

Usage

Zero-shot image classification tasks.

Evaluate CLIP on imagenet-1k dataset.

python src/clip_eval/eval.py --model  openai/clip-vit-base-patch16 --dataset imagenet-1k

The output json file (results/imagenet-1k/openai-clip-vit-base-patch16.json) is like this:

{
    "top1": 4.1579999999999995,
    "top5": 8.816,
    "top10": 11.584,
    "top100": 30.296
}

When evaluating on Recruit dataset, you can specify the --subcategory option to evaluate on a specific subcategory.

python src/clip_eval/eval.py --model  openai/clip-vit-base-patch16 --dataset recruit --subcategory "jafacility20"

Zero-shot image-to-text and text-to-image retrieval tasks.

Evaluate CLIP on crossmodal3600 dataset.

python src/clip_eval/eval.py --model  openai/clip-vit-base-patch16 --dataset crossmodal3600

Embedding Analysis

You can calculate the similarity matrix of the embeddings (Only first (image,text) pair per class is used).

python src/clip_eval/embedding_analysis.py --model line-corporation/clip-japanese-base --dataset recruit --batch_size 16

The generated image is like this:

similarity_matrix

Similarity matrix of embeddings.

This matrix is calculated by

text_embeddings # (num_classes, embedding_dim)
image_embeddings # (num_classes, embedding_dim)
embeddings = torch.cat([text_embeddings, image_embeddings], dim=0) # (2*num_classes, embedding_dim)
normalized_embeddings = torch.nn.functional.normalize(embeddings, dim=1) # (2*num_classes, embedding_dim)
similarity_matrix = normalized_embeddings @ normalized_embeddings.T # (2*num_classes, 2*num_classes)

So, the Left-Top submatrix is the similarity matrix of text embeddings, and the Right-Bottom submatrix is the similarity matrix of image embeddings. The Right-Top and Left-Bottom submatrices are the similarity between text and image embeddings.

You can also visualize the embeddings using t-SNE (Only first 10 classes are used).

python src/clip_eval/tsne_plot.py --model line-corporation/clip-japanese-base --dataset_name cifar10 --batch_size 16

The generated image is like this:

similarity_matrix

t-SNE plot of image embeddings.

Supported Models

Supported Datasets

Image Classification

  • imagenet-1k: ImageNet-1k image classification dataset
  • recruit: Japanese-culture related image classification dataset
  • cifar100: CIFAR-100 image classification dataset
  • cifar10: CIFAR-10 image classification dataset
  • food101: Food-101 image classification dataset
  • caltech101: Caltech-101 image classification dataset

Image-Text Retrieval

Acknowledgement

We would like to acknowledge the following codebases that served as the foundation for our work:

We would also like to express our gratitude to the dataset and model developers.

About

clip-eval is a tool for evaluating CLIP models on various image classification and image-text retrieval tasks in Japanese.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages