This project aims to compare the embedding spaces of three distinct modalities:
- Images (using a Google Vision Transformer)
- Text (labels of images using BERT Language Model)
- EEG recordings (from human subjects viewing the images, using EEGNet)
We explore how different modalities represent the same entities. Specifically, we investigate:
- Whether modalities have inherent biases toward certain entities.
- How brain representations (EEG) differ from those generated by computers (image/text embeddings).
-
Image Embeddings: Extracted using the pre-trained Google Vision Transformer.
-
Text Embeddings: Derived from the BERT Language Model using the corresponding object labels.
-
EEG Embeddings: Extracted from EEG recordings of subjects viewing the images using EEGNet. This compact CNN, initially designed for EEG-based brain-computer interfaces, was fine-tuned by removing its classification layer to serve as an embedding extractor.
Extracting EEG embeddings was the primary challenge due to limited expertise and time in EEG signal processing. We initially aimed to use NeuroGPT, but due to its complexity, we switched to EEGNet, fine-tuning it to suit our needs.
We used three methods to compare the embedding spaces:
- Clustering: To explore the hierarchical representation of entities across modalities.
- Distance Comparison: To assess the similarity between representations across different modalities.
- Cross-Composition: To determine if a linear mapping exists between the embedding spaces.