clip-eval

clip-eval is a tool for evaluating CLIP models on various image classification and image-text retrieval tasks in Japanese.

Installation

rye sync

Usage

Zero-shot image classification tasks.

Evaluate CLIP on imagenet-1k dataset.

python src/clip_eval/eval.py --model  openai/clip-vit-base-patch16 --dataset imagenet-1k

The output json file (results/imagenet-1k/openai-clip-vit-base-patch16.json) is like this:

{
    "top1": 4.1579999999999995,
    "top5": 8.816,
    "top10": 11.584,
    "top100": 30.296
}

When evaluating on Recruit dataset, you can specify the --subcategory option to evaluate on a specific subcategory.

python src/clip_eval/eval.py --model  openai/clip-vit-base-patch16 --dataset recruit --subcategory "jafacility20"

Zero-shot image-to-text and text-to-image retrieval tasks.

Evaluate CLIP on crossmodal3600 dataset.

python src/clip_eval/eval.py --model  openai/clip-vit-base-patch16 --dataset crossmodal3600

Embedding Analysis

You can calculate the similarity matrix of the embeddings (Only first (image,text) pair per class is used).

python src/clip_eval/embedding_analysis.py --model line-corporation/clip-japanese-base --dataset recruit --batch_size 16

The generated image is like this:

Similarity matrix of embeddings.

This matrix is calculated by

text_embeddings # (num_classes, embedding_dim)
image_embeddings # (num_classes, embedding_dim)
embeddings = torch.cat([text_embeddings, image_embeddings], dim=0) # (2*num_classes, embedding_dim)
normalized_embeddings = torch.nn.functional.normalize(embeddings, dim=1) # (2*num_classes, embedding_dim)
similarity_matrix = normalized_embeddings @ normalized_embeddings.T # (2*num_classes, 2*num_classes)

So, the Left-Top submatrix is the similarity matrix of text embeddings, and the Right-Bottom submatrix is the similarity matrix of image embeddings. The Right-Top and Left-Bottom submatrices are the similarity between text and image embeddings.

You can also visualize the embeddings using t-SNE (Only first 10 classes are used).

python src/clip_eval/tsne_plot.py --model line-corporation/clip-japanese-base --dataset_name cifar10 --batch_size 16

The generated image is like this:

t-SNE plot of image embeddings.

Supported Models

Supported Datasets

Image Classification

imagenet-1k: ImageNet-1k image classification dataset
recruit: Japanese-culture related image classification dataset
cifar100: CIFAR-100 image classification dataset
cifar10: CIFAR-10 image classification dataset
food101: Food-101 image classification dataset
caltech101: Caltech-101 image classification dataset

Image-Text Retrieval

crossmodal3600: Cross-modal image-text retrieval dataset

Acknowledgement

We would like to acknowledge the following codebases that served as the foundation for our work:

We would also like to express our gratitude to the dataset and model developers.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
images		images
src/clip_eval		src/clip_eval
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

clip-eval

Installation

Usage

Zero-shot image classification tasks.

Zero-shot image-to-text and text-to-image retrieval tasks.

Embedding Analysis

Supported Models

Supported Datasets

Image Classification

Image-Text Retrieval

Acknowledgement

About

Releases

Packages

Languages

License

llm-jp/clip-eval

Folders and files

Latest commit

History

Repository files navigation

clip-eval

Installation

Usage

Zero-shot image classification tasks.

Zero-shot image-to-text and text-to-image retrieval tasks.

Embedding Analysis

Supported Models

Supported Datasets

Image Classification

Image-Text Retrieval

Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages