This repository contains the winning solution for the Sensorium 2023, part of the NeurIPS 2023 competition track. The competition aimed to find the best model that can predict the activity of neurons in the primary visual cortex of mice in response to videos. The competition introduced a temporal component using dynamic stimuli (videos) instead of static stimuli (images) used in Sensorium 2022, making the task more challenging.
The primary metric of the competition was a single trial correlation. You can read about the metric, the data, and the task in the competition paper 1. It is important to note that additional data for five mice was introduced during the competition, which doubled the dataset's size (old and new data).
Key points:
- DwiseNeuro - novel model architecture for prediction neural activity in the mouse primary visual cortex.
- Solid cross-validation strategy with splitting folds by perceptual video hash.
- Training on all mice with an option to fill unlabeled samples via distillation.
During the competition, I dedicated most of my time to designing the model architecture since it significantly impacted the solution's outcome compared to other components. I iteratively tested various computer vision and deep learning techniques, integrating them into the architecture as the correlation metric improved.
The diagram below illustrates the final architecture, which I named DwiseNeuro:
DwiseNeuro consists of three main parts: core, cortex, and readouts. The core consumes sequences of video frames and mouse behavior activity in separate channels, processing temporal and spatial features. Produced features pass through global pooling over spatial dimensions to aggregate them. The cortex processes the pooled features independently for each timestep, significantly increasing the channels. Finally, each readout predicts the activation of neurons for the corresponding mouse.
In the following sections, we will delve deeper into each part of the architecture.
The first layer of the module is the stem. It's a point-wise 3D convolution for increasing the number of channels, followed by batch normalization.
The rest of the core consists of inverted residual blocks 23 with a narrow -> wide -> narrow
channel structure.
Several methods were added to the inverted residual block rewritten in 3D layers:
- Absolute Position Encoding 4 - summing the encoding to the input of each block allows convolutions to accumulate position information. It's quite important because of the subsequent spatial pooling after the core.
- Factorized (2+1)D convolution 5 - 3D depth-wise convolution was replaced with a spatial 2D depth-wise convolution followed by a temporal 1D depth-wise convolution. There are spatial convolutions with stride two in some blocks to compress output size.
- Shortcut Connections - completely parameter-free residual shortcuts with three operations:
- Identity mapping if input and output dimensions are equal. It's the same as the connection proposed in ResNet 6.
- Nearest interpolation in case of different spatial sizes.
- Cycling repeating of channels if they don't match.
- Squeeze-and-Excitation 7 - dynamic channel-wise feature recalibration.
- DropPath (Stochastic Depth) 89 - regularization that randomly drops the block's main path for each sample in batch.
Batch normalization is applied after each layer, including the shortcut. SiLU activation is used after expansion and depth-wise convolutions.
I found that the number of core blocks and their parameters dramatically affect the outcome. It's possible to tune channels, strides, expansion ratio, and spatial/temporal kernel sizes. Obviously, it is almost impossible to start experiments with optimal values. The problem is mentioned in the EfficientNet 3 paper, which concluded that it is essential to carefully balance model width, depth, and resolution.
After conducting a lot of experiments, I chose the following parameters:
- Four blocks with 64 output channels, three with 128, and two with 256.
- Three blocks have stride two. They are the first in each subgroup from the point above.
- Expansion ratio of the inverted residual block is six.
- Kernel of spatial depth-wise convolution is (1, 3, 3).
- Kernel of temporal depth-wise convolution is (5, 1, 1).
Compared with related works 1011, the model architecture includes a new part called the cortex. It is also common for all mice as the core. The cortex receives features with channels and temporal dimensions only. Spatial information was accumulated through position encoding applied earlier in the core and compressed by average pooling after the core. The primary purpose of the cortex is to smoothly increase the number of channels, which the readouts will further use.
The building element of the module is a grouped 1D convolution followed by the channel shuffle operation 12. Batch normalization, SiLU activation, and shortcut connections with stochastic depth were applied similarly to the core.
Hyperparameters of the cortex were also important:
- Convolution with two groups and kernel size one (bigger kernel size over temporal dimension has not led to better results).
- Three layers with 1024, 2048, and 4096 channels.
As you can see, the number of channels is quite large. Groups help optimize computation and memory efficiency. Channel shuffle operation allows the sharing of information between groups of different layers.
The readout is a single 1D convolution with two groups and kernel size one, followed by Softplus activation. Each of the ten mice has its readout with the number of output channels equal to the number of neurons (7863, 7908, 8202, 7939, 8122, 7440, 7928, 8285, 7671, 7495, respectively).
Keeping the response positive by using Softplus 10 was essential in my pipeline.
It works much better than ELU + 1
11, especially with tuning the Softplus beta parameter.
In my case, the optimal beta value was about 0.07, which resulted in a 0.018 increase in the correlation metric.
You can see a comparison of ELU + 1
and Softplus in the plot below:
I also conducted an experiment where the beta parameter was trainable. Interestingly, the trained value converged approximately to the optimal, which I found by grid search. I omitted the learnable Softplus from the solution because it resulted in a slightly worse score. But this may be an excellent way to quickly and automatically find a good beta.
Here's a numerical stable implementation of learnable Softplus in PyTorch:
import torch
from torch import nn
class LearnableSoftplus(nn.Module):
def __init__(self, beta: float):
super().__init__()
self.beta = nn.Parameter(torch.tensor(float(beta)))
def forward(self, x):
xb = x * self.beta
return (torch.clamp(xb, 0) + torch.minimum(xb, -xb).exp().log1p()) / self.beta
At the end of the competition, I employed 7-fold cross-validation to check hypotheses and tune hyperparameters more precisely. I used all available labeled data to make folds. Random splitting gave an overly optimistic metric estimation because some videos were duplicated (e.g., in the original validation split or between old and new datasets). To solve this issue, I created group folds with non-overlapping videos. Similar videos were found using perceptual hashes of several frames fetched deterministically.
Basic training (config)
The training was performed in two stages. The first stage is basic training with the following pipeline parameters:
- Learning rate warmup for the first three epochs from 0 to 2.4e-03, cosine annealing last 18 epochs to 2.4e-05
- Batch size 32, one training epoch comprises 72000 samples
- Optimizer AdamW 13 with weight decay 0.05
- Poisson loss
- Model EMA with decay 0.999
- CutMix 14 with alpha 1.0 and usage probability 0.5
- The sampling of different mice in the batch is random by uniform distribution
Each dataset sample consists of a grayscale video, behavior activity measurements (pupil center, pupil dilation, and running speed), and the neuron responses of a single mouse. All data is presented at 30 FPS. During training, the model consumes 16 frames, skipping every second (equivalent to 16 neighboring frames at 15 FPS). The video frames were zero-padded to 64x64 pixels. The behavior activities were added as separate channels. The entire tensor channel was filled with the value for each behavior measurement. No normalization is applied to the target and input tensors during training.
The ensemble of models from all folds gets 0.2905 single-trial correlation on the main track and 0.2207 on the bonus track in the final phase of the competition. This result would be enough to take first place in the main and bonus (out-of-distribution) competition tracks.
Knowledge Distillation (config)
For an individual sample in the batch, the loss was calculated for the responses of only one mouse. Because the input tensor is associated with a single mouse trial, and there are no neural activity data for other mice. However, the model can predict responses for all mice from the input tensor. In the second stage of training, I used a method similar to knowledge distillation 15. I created a pipeline where models from the first stage predict unlabeled responses during training. As a result, the second-stage models trained all their readouts via each batch sample. The loss value on distilled predictions was weighed to be 0.36% of the overall loss.
The hyperparameters were identical, except for the expansion ratio in inverted residual blocks: seven in the first stage and six in the second.
In the second stage, the ensemble of models achieves nearly the same single-trial correlation as the ensemble from the first stage. However, what is fascinating is that each fold model performs better by an average score of 0.007 than the corresponding model from the first stage. Thus, the distilled model works like an ensemble of undistilled models. According to the work 16, the individual model is forced to learn the ensemble's performance during knowledge distillation, and an ensemble of distilled models offers no more performance boost. I can observe the same behavior in my solution.
Distillation can be a great practice if you need one good model. But in ensembles, this leads to minor changes in performance.
The ensembles were produced by taking the arithmetic mean of predictions from multiple steps:
- Overlapping a sliding window over each possible sequence of frames.
- Models from cross-validations of one training stage.
- Training stages (optional).
I used the same model weights for both competition tracks. The competition only evaluated the responses of five mice from the new dataset, so I only predicted those.
You can see the progress of solution development during the competition in spreadsheets (the document consists of multiple sheets). Unfortunately, the document contains less than half of the conducted experiments because sometimes I was too lazy to fill it :) However, if you need a complete chronology of valuable changes, you can see it in git history.
To summarize, an early model with depth-wise 3D convolution blocks achieved a score of around 0.19 on the main track during the live phase of the competition. Subsequently, implementing techniques from the core section, tuning hyperparameters, and training on all available data boosted the score to 0.25. Applying non-standard normalization, expected by evaluation servers, on postprocessing improved the score to 0.27. The cortex and CutMix increased the score to 0.276. Then, the beta value of Softplus was tuned, resulting in a score of 0.294. Lastly, adjusting drop rate and batch size parameters helped to achieve a score of 0.3 on the main track during the live phase.
The ensemble of the basic and distillation training stages achieved a single-trial correlation of 0.2913 on the main track and 0.2215 on the bonus track in the final phase (0.3005 and 0.2173 in the live phase, respectively). This result is just a bit better than the basic training result alone, but I should provide it because it was the best submission in the competition. In addition, it was interesting to research the relation between ensembling and distillation, which I wrote above.
Thanks to the Sensorium organizers and participants for the excellent competition. Thanks to my family and friends who supported me during the competition!
- Linux (tested on Ubuntu 20.04 and 22.04)
- NVIDIA GPU (models trained on RTX A6000)
- NVIDIA Drivers >= 535, CUDA >= 12.2
- Docker
- NVIDIA Container Toolkit
Pipeline tuned for training on a single RTX A6000 with 48 GB.
In the case of GPU with less memory, you can use gradient accumulation by increasing the iter_size
parameter in training configs.
It will worsen the result (by a 0.002 score for "iter_size": 2
), but it has less than the effect of reducing the batch size.
Clone the repo and enter the folder.
git clone [email protected]:lRomul/sensorium.git
cd sensorium
Build a Docker image and run a container.
Here is a small guide on how to use the provided Makefile
make # stop, build, run
# do the same
make stop
make build
make run
make # by default all GPUs passed
make GPUS=all # do the same
make GPUS=none # without GPUs
make run GPUS=2 # pass the first two GPUs
make run GPUS='\"device=1,2\"' # pass GPUs numbered 1 and 2
make logs
make exec # run a new command in a running container
make exec COMMAND="bash" # do the same
make stop
make
From now on, you should run all commands inside the docker container.
If you already have the Sensorium 2023 dataset (148 GB), copy it to the folder ./data/sensorium_all_2023/
.
Otherwise, use the script for downloading:
python scripts/download_data.py
You can now reproduce the final results of the solution using the following commands:
# Train
# The training time is 3.5 days (12 hours per fold) for each experiment on a single A6000
# You can speed up the process by using the --folds argument to train folds in parallel
# Or just download trained weights in the section below
python scripts/train.py -e true_batch_001
python scripts/train.py -e distillation_001
# Predict
# Any GPU with more than 6 GB memory will be enough
python scripts/predict.py -e true_batch_001 -s live_test_main
python scripts/predict.py -e true_batch_001 -s live_test_bonus
python scripts/predict.py -e true_batch_001 -s final_test_main
python scripts/predict.py -e true_batch_001 -s final_test_bonus
python scripts/predict.py -e distillation_001 -s live_test_main
python scripts/predict.py -e distillation_001 -s live_test_bonus
python scripts/predict.py -e distillation_001 -s final_test_main
python scripts/predict.py -e distillation_001 -s final_test_bonus
# Ensemble predictions of two experiments
python scripts/ensemble.py -e distillation_001,true_batch_001 -s live_test_main
python scripts/ensemble.py -e distillation_001,true_batch_001 -s live_test_bonus
python scripts/ensemble.py -e distillation_001,true_batch_001 -s final_test_main
python scripts/ensemble.py -e distillation_001,true_batch_001 -s final_test_bonus
# Final predictions will be there
cd data/predictions/distillation_001,true_batch_001
You can skip the training step by downloading model weights (9.5 GB) using torrent file.
Place the files in the data directory so that the folder structure is as follows:
data
├── experiments
│ ├── distillation_001
│ └── true_batch_001
└── sensorium_all_2023
├── dynamic29156-11-10-Video-8744edeac3b4d1ce16b680916b5267ce
├── dynamic29228-2-10-Video-8744edeac3b4d1ce16b680916b5267ce
├── dynamic29234-6-9-Video-8744edeac3b4d1ce16b680916b5267ce
├── dynamic29513-3-5-Video-8744edeac3b4d1ce16b680916b5267ce
├── dynamic29514-2-9-Video-8744edeac3b4d1ce16b680916b5267ce
├── dynamic29515-10-12-Video-9b4f6a1a067fe51e15306b9628efea20
├── dynamic29623-4-9-Video-9b4f6a1a067fe51e15306b9628efea20
├── dynamic29647-19-8-Video-9b4f6a1a067fe51e15306b9628efea20
├── dynamic29712-5-9-Video-9b4f6a1a067fe51e15306b9628efea20
└── dynamic29755-2-8-Video-9b4f6a1a067fe51e15306b9628efea20
Footnotes
-
Turishcheva, Polina, et al. (2023). The Dynamic Sensorium competition for predicting large-scale mouse visual cortex activity from videos. https://arxiv.org/abs/2305.19654 ↩
-
Sandler, Mark, et al. (2018). Mobilenetv2: Inverted residuals and linear bottlenecks. https://arxiv.org/abs/1801.04381 ↩
-
Tan, Mingxing, and Quoc Le. (2019). Efficientnet: Rethinking model scaling for convolutional neural networks. https://arxiv.org/abs/1905.11946 ↩ ↩2
-
Vaswani, Ashish, et al. (2017). Attention is all you need. https://arxiv.org/abs/1706.03762 ↩
-
Tran, Du, et al. (2018). A closer look at spatiotemporal convolutions for action recognition. https://arxiv.org/abs/1711.11248 ↩
-
He, Kaiming, et al. (2016). Deep residual learning for image recognition. https://arxiv.org/abs/1512.03385 ↩
-
Hu, Jie, Li Shen, and Gang Sun. (2018). Squeeze-and-excitation networks. https://arxiv.org/abs/1709.01507 ↩
-
Larsson, Gustav, Michael Maire, and Gregory Shakhnarovich. (2016). Fractalnet: Ultra-deep neural networks without residuals. https://arxiv.org/abs/1605.07648 ↩
-
Huang, Gao, et al. (2016). Deep networks with stochastic depth. https://arxiv.org/abs/1603.09382 ↩
-
Höfling, Larissa, et al. (2022). A chromatic feature detector in the retina signals visual context changes. https://www.biorxiv.org/content/10.1101/2022.11.30.518492 ↩ ↩2
-
Lurz, Konstantin-Klemens, et al. (2020). Generalization in data-driven models of primary visual cortex. https://www.biorxiv.org/content/10.1101/2020.10.05.326256 ↩ ↩2
-
Zhang, Xiangyu, et al. (2018). Shufflenet: An extremely efficient convolutional neural network for mobile devices. https://arxiv.org/abs/1707.01083 ↩
-
Loshchilov, Ilya, and Frank Hutter. (2017). Decoupled weight decay regularization. https://arxiv.org/abs/1711.05101 ↩
-
Yun, Sangdoo, et al. (2019). Cutmix: Regularization strategy to train strong classifiers with localizable features. https://arxiv.org/abs/1905.04899 ↩
-
Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. (2015). Distilling the knowledge in a neural network. https://arxiv.org/abs/1503.02531 ↩
-
Allen-Zhu, Zeyuan, and Yuanzhi Li. (2020). Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. https://arxiv.org/abs/2012.09816 ↩