Vision Transformer Architecture Reimplementation

This project implements the Vision Transformer (ViT) architecture from the paper An Image is Worth 16x16 Words by Alexey Dosovitskiy et al. The Vision Transformer is a transformer-based model that applies the transformer architecture, originally developed for natural language processing tasks, to image recognition tasks.

Key Features

Transformer Architecture: The ViT model uses a standard transformer encoder architecture, treating an image as a sequence of patches and encoding the patches using a transformer encoder.
Image Patch Embedding: The input image is split into fixed-size patches, which are then linearly embedded and serve as the input sequence for the transformer encoder.
Position Embeddings: To retain positional information, learnable position embeddings are added to the patch embeddings.
Pre-training on Large Datasets: The ViT model can be pre-trained on large datasets like ImageNet and then fine-tuned on downstream tasks. (I'm using CIFAR dataset for training)

Installation

To install the necessary dependencies, run:

pip install -r requirements.txt

Training the Model

To train the model, execute the following command:

python train.py

Dataset

This implementation uses the CIFAR10 dataset, a collection of images consist of 60000 32x32 colour images in 10 classes, with 6000 images per class.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Vision Transformer Architecture Reimplementation

Key Features

Installation

Training the Model

Dataset

Files

README.md

Latest commit

History

README.md

File metadata and controls

Vision Transformer Architecture Reimplementation

Key Features

Installation

Training the Model

Dataset