Welcome to DataCV Challenge 2023!
This is the development kit repository for the 1st DataCV Challenge. This repository includes details on how to download datasets, run baseline models, and organize your result as answer.zip
. The final evaluation will occur on the CodeLab evaluation server, where all competition information, rules, and dates can be found.
Label-free model evaluation is the competition task. It is different from the standard evaluation that calculates model accuracy based on model outputs and corresponding test labels. Label-free model evaluation (AutoEval), on the other hand, has no access to test labels. In this competition, participants need to design a method that can estimate the model accuracies on test sets without ground truths.
In total, the test set comprises 100 datasets. As a result, each model's accuracy should be predicted 100 times. Given that there are two models to be evaluated, the expected number of lines in the "answer.txt" file is 200.
The first 100 lines represent the accuracy predictions of the ResNet model, while the second 100 lines represent those of the RepVGG model. Each of the 100-line predictions is the accuracy prediction for the model on xxx.npy dataset, where xxx goes from 001 to 100.
To prepare your submission, you need to write your predicted accuracies into a plain text file named "answer.txt", with one prediction (e.g., 0.876543) per line. For example,
0.100000
0.100000
.
.
.
0.100000
0.100000
0.100000
0.100000
Then, zip the text file and it submit to the competition website.
How to organize an answer.txt file for validation evaluation?
Please refer to Organize Results for Submission.
In the competition, you are only required to submit the zipped prediction results named the "answer.txt". We give an example for this txt file at answer.txt demo.
The training dataset consists of 1,000 transformed datasets from the original CIFAR-10 test set, using the transformation strategy proposed by Deng et al. (2021). The validation set was composed of CIFAR-10.1, CIFAR-10.1-C (add corruptions (Hendrycks et al., 2019) to CIFAR-10.1 dataset), and CIFAR-10-F (real-world images collected from Flickr)
The CIFAR-10.1 dataset is a single dataset. In contrast, CIFAR-10.1-C and CIFAR-10-F contain 19 and 20 datasets, respectively. Therefore, the total number of datasets in the validation set is 40.
The training datasets share a common label file named labels.npy
, and images files are named new_data_xxx.npy
, where xxx
is a number from 000 to 999. For every dataset in the validation set, the image file and their labels are stored as two separate Numpy
array files named "data.npy" and "labels.npy". The PyTorch implementation of the Dataset class for loading the data can be found in utils.py
.
Download the training datasets: link
Download the validation datasets: link
Download the training datasets' accuracies on the ResNet-56 model: link
Download the training datasets' accuracies on the RepVGG-A0 model: link
NOTE: To access the test datasets and participate in the competition, please fill in the Datasets Request Form and send the signed form to the competition organiser. Failing to provide the form will lead to the revocation of the CodaLab account in the competition.
In this competition, the classifiers being evaluated are ResNet-56 and RepVGG-A0. Both implementations can be accessed in the public repository at https://github.com/chenyaofo/pytorch-cifar-models. To utilize the models and load their pretrained weights, use the code provided.
import torch
model = torch.hub.load("chenyaofo/pytorch-cifar-models", "cifar10_resnet56", pretrained=True)
model = torch.hub.load("chenyaofo/pytorch-cifar-models", "cifar10_repvgg_a0", pretrained=True)
As we use automated evaluation scripts for submissions, the format of submitted files is very important. So, there is a function to store the accuracy predictions into the required format, named store_ans
in code/utils.py
and results_format/get_answer_txt.py
.
Read the results_format/get_answer_txt.py
file to comprehend the function usage. Execute the code below to see the results.
python3 results_format/get_answer_txt.py
The necessary dependencies are specified in the requirements.txt
file and the experiments were conducted using Python version 3.10.8, with a single GeForce RTX 2080 Ti GPU.
The table presented below displays the results of the foundational measurements using root-mean-square error (RMSE). Accuracies are converted into percentages prior to calculation.
Method | CIFAR-10.1 | CIFAR-10.1-C | CIFAR-10-F | Overall |
---|---|---|---|---|
Rotation | 7.285 | 6.386 | 7.763 | 7.129 |
ConfScore | 2.190 | 9.743 | 2.676 | 6.985 |
Entropy | 2.424 | 10.300 | 2.913 | 7.402 |
ATC | 11.428 | 5.964 | 8.960 | 7.766 |
FID | 7.517 | 5.145 | 4.662 | 4.985 |
Method | CIFAR-10.1 | CIFAR-10.1-C | CIFAR-10-F | Overall |
---|---|---|---|---|
Rotation | 16.726 | 17.137 | 8.105 | 13.391 |
ConfScore | 5.470 | 12.004 | 3.709 | 8.722 |
Entropy | 5.997 | 12.645 | 3.419 | 9.093 |
ATC | 15.168 | 8.050 | 7.694 | 8.132 |
FID | 10.718 | 6.318 | 5.245 | 5.966 |
To install required Python libraries, execute the code below.
pip3 install -r requirements.txt
The above results can be replicated by executing the code provided below in the terminal.
cd code/
chmod u+x run_baselines.sh && ./run_baselines.sh
To run one specific baseline, use the code below.
cd code/
python3 get_accuracy.py --model <resnet/repvgg> --dataset_path DATASET_PATH
python3 baselines/BASELINE.py --model <resnet/repvgg> --dataset_path DATASET_PATH
The following succinctly outlines the methodology of each method, as detailed in the appendix of "Predicting Out-of-Distribution Error with the Projection Norm" paper (Yu et al., 2022).
Rotation. The Rotation Prediction (Rotation) (Deng et al., 2021) metric is defined as
where
ConfScore. The Averaged Confidence (ConfScore) (Hendrycks & Gimpel, 2016) is defined as
Entropy. The Entropy (Guillory et al., 2021) metric is defined as
ATC. The Averaged Threshold Confidence (ATC) (Garg et al., 2022) is defined as
where
FID. The Frechet Distance (FD) between datasets (Deng et al., 2020) is defined as
where
The Frechet Distance calculation functions utilized in this analysis were sourced from a publicly available repository by Weijian Deng.