clip-eval is a tool for evaluating CLIP models on various image classification and image-text retrieval tasks in Japanese.
rye sync
Evaluate CLIP on imagenet-1k dataset.
python src/clip_eval/eval.py --model openai/clip-vit-base-patch16 --dataset imagenet-1k
The output json file (results/imagenet-1k/openai-clip-vit-base-patch16.json
) is like this:
{
"top1": 4.1579999999999995,
"top5": 8.816,
"top10": 11.584,
"top100": 30.296
}
When evaluating on Recruit
dataset, you can specify the --subcategory
option to evaluate on a specific subcategory.
python src/clip_eval/eval.py --model openai/clip-vit-base-patch16 --dataset recruit --subcategory "jafacility20"
Evaluate CLIP on crossmodal3600 dataset.
python src/clip_eval/eval.py --model openai/clip-vit-base-patch16 --dataset crossmodal3600
You can calculate the similarity matrix of the embeddings (Only first (image,text) pair per class is used).
python src/clip_eval/embedding_analysis.py --model line-corporation/clip-japanese-base --dataset recruit --batch_size 16
The generated image is like this:
Similarity matrix of embeddings.This matrix is calculated by
text_embeddings # (num_classes, embedding_dim)
image_embeddings # (num_classes, embedding_dim)
embeddings = torch.cat([text_embeddings, image_embeddings], dim=0) # (2*num_classes, embedding_dim)
normalized_embeddings = torch.nn.functional.normalize(embeddings, dim=1) # (2*num_classes, embedding_dim)
similarity_matrix = normalized_embeddings @ normalized_embeddings.T # (2*num_classes, 2*num_classes)
So, the Left-Top submatrix is the similarity matrix of text embeddings, and the Right-Bottom submatrix is the similarity matrix of image embeddings. The Right-Top and Left-Bottom submatrices are the similarity between text and image embeddings.
You can also visualize the embeddings using t-SNE (Only first 10 classes are used).
python src/clip_eval/tsne_plot.py --model line-corporation/clip-japanese-base --dataset_name cifar10 --batch_size 16
The generated image is like this:
t-SNE plot of image embeddings.- line-corporation/clip-japanese-base
- rinna/japanese-cloob-vit-b-16
- rinna/japanese-clip-vit-b-16
- hf-hub:laion/CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k
- stabilityai/japanese-stable-clip-vit-l-16
- openai/clip-vit-base-patch16
- openai/clip-vit-large-patch14
- jinaai/jina-clip-v2
- google/siglip-base-patch16-256-multilingual
imagenet-1k
: ImageNet-1k image classification datasetrecruit
: Japanese-culture related image classification datasetcifar100
: CIFAR-100 image classification datasetcifar10
: CIFAR-10 image classification datasetfood101
: Food-101 image classification datasetcaltech101
: Caltech-101 image classification dataset
crossmodal3600
: Cross-modal image-text retrieval dataset
We would like to acknowledge the following codebases that served as the foundation for our work:
We would also like to express our gratitude to the dataset and model developers.