Macro F1 score calculation for binary classification not working as expected #2549

96francesco · 2024-05-20T09:42:10Z

96francesco
May 20, 2024

Hi all, I am encountering a discrepancy in the macro F1 score calculation when using torchmetrics for binary classification. The macro F1 score calculated using torchmetrics is identical to the F1 score of class 1, rather than the average of the F1 scores for both classes. I am not sure if this is expected behavior or a potential issue. I was not able to calculate the F1 score for separate classes with BinaryF1Score, so I opted to use the multiclass setting.
I am encountering this error with a CNN for semantic segmentation on satellite images, but I get the same discrepancy with a mock dataset and model.

Here is the code:

import pytorch_lightning as pl
import torch
import torch.nn as nn
import torchmetrics

from torch.utils.data import DataLoader, Dataset
from pytorch_lightning import seed_everything

seed = 42
torch.manual_seed(seed)
seed_everything(seed, workers=True)


class RandomDataset(Dataset):
      def __init__(self, size, length):
            self.len = length
            self.data = torch.randn(length, 1, size, size)
            self.labels = torch.randint(0, 2, (length, 1, size, size)).float()

      def __getitem__(self, index):
            return self.data[index], self.data[index], self.labels[index]

      def __len__(self):
            return self.len

class DummyModel(pl.LightningModule):
      def __init__(self, lr=1e-3, threshold=0.5):
            super(DummyModel, self).__init__()
            self.lr = lr
            self.threshold = threshold
            self.criterion = nn.BCEWithLogitsLoss()

            # dummy encoder and decoder
            self.encoder = nn.Sequential(
                  nn.Conv2d(1, 16, kernel_size=3, padding=1),
                  nn.ReLU(),
                  nn.Conv2d(16, 32, kernel_size=3, padding=1),
                  nn.ReLU()
            )
            self.decoder = nn.Sequential(
                  nn.Conv2d(32, 16, kernel_size=3, padding=1),
                  nn.ReLU(),
                  nn.Conv2d(16, 1, kernel_size=3, padding=1)
            )

            # initialize F1 score calculation
            self.f1_score_macro = torchmetrics.F1Score(task='binary', 
                                                       average='macro', 
                                                       threshold=threshold)
            self.f1_score_class = torchmetrics.F1Score(task='multiclass',
                                                        average='none', 
                                                        threshold=threshold, 
                                                        num_classes=2)

      def forward(self, x):
            x = self.encoder(x)
            x = self.decoder(x)
            return x

      def training_step(self, batch, batch_idx):
            x, labels = batch
            logits = self(x)
            loss = self.criterion(logits, labels)
            return loss

      def test_step(self, batch, batch_idx):
            x1, x2, labels = batch
            logits = self(x1)
            probs = torch.sigmoid(logits)
            preds = (probs > self.threshold).float()

            # calculate class-wise F1
            f1_score_class = self.f1_score_class(preds.squeeze(1), labels.squeeze(1))

            # calculate macro F1 manually
            f1_score_macro_manual = f1_score_class.mean()

            # calculate macro F1 using torchmetrics
            f1_score_macro_torch = self.f1_score_macro(preds, labels)

            # log metrics
            print(f'test_f1score_class0: {f1_score_class[0]}')
            print(f'test_f1score_class1: {f1_score_class[1]}')
            print(f'test_f1score_macro_manual: {f1_score_macro_manual}')
            print(f'test_f1score_macro_torchmetrics: {f1_score_macro_torch}')

      def configure_optimizers(self):
            return torch.optim.Adam(self.parameters(), lr=self.lr)

# initialize mock dataset
dataset = RandomDataset(32, 100)
dataloader = DataLoader(dataset, batch_size=10)

model = DummyModel()
trainer = pl.Trainer(max_epochs=1)
trainer.test(model, dataloaders=dataloader)

I would expect to obtain a macro F1 score that is the average of the F1 scores of both classes with torchmetrics, but this is the result:

test_f1score_class0: 0.6693527698516846
test_f1score_class1: 0.0007855459698475897
test_f1score_macro_manual: 0.3350691497325897
test_f1score_macro_torchmetrics: 0.0007855459698475897

As you can see, the macro F1 score calculated with torchmetrics equals the F1 score of class 1.

Some details of my environment:

pytorch-lightning         2.2.0.post0              
torch                     2.2.0+cu118             
torchmetrics              1.3.1                    
torchvision               0.17.0+cu118

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Macro F1 score calculation for binary classification not working as expected #2549

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Macro F1 score calculation for binary classification not working as expected #2549

96francesco May 20, 2024

Replies: 0 comments

96francesco
May 20, 2024