Skip to content
This repository has been archived by the owner on Sep 28, 2021. It is now read-only.

classification loss - negative sample labels #6

Open
DishaDRao opened this issue Mar 14, 2021 · 14 comments
Open

classification loss - negative sample labels #6

DishaDRao opened this issue Mar 14, 2021 · 14 comments

Comments

@DishaDRao
Copy link

DishaDRao commented Mar 14, 2021

Hi,

In the following snippet from the loss.py file:
''classify_loss = 0.5 * self.classify_loss(
pos_prob, pos_labels[:, 0]) + 0.5 * self.classify_loss(
neg_prob, neg_labels + 1) ''

why is the target label for sigmoid loss of negative samples given as ' neg_labels + 1' ?
Shouldn't it be just 'neg_labels'? (as the value of 'neg_labels' is initilaized as 0 in itself)

@naoe1999
Copy link

@DishaDRao
I have the same concern about this.

I have fully trained with the existing code and got strange output which doesn't make sense at all.
So I doubt the loss function in the same way you mentioned.

Did you get some result on it?
I will try though.

@DishaDRao
Copy link
Author

@naoe1999

Well, I did not try this code. However, I went through the code base of the original implementation ( the winning team) and understood where this labelling came from.

Basically, in the original implementation, the target labels for negative anchor boxes ('neg_labels') are given a label '-1'. Hence it makes sense to write 'neg_labels + 1' during the loss computation to make it 0. ( 0 stands for no object and 1 stands for an object).

However, in the current code base, the target labels for negative anchor boxes are already given a label of '0'. So it doesn't make sense to write 'neg_labels +1' during the loss computation.

In short, I think it's a mistake here and I suggest to run this code without adding 1 to neg_labels.
(''classify_loss = 0.5 * self.classify_loss(
pos_prob, pos_labels[:, 0]) + 0.5 * self.classify_loss(
neg_prob, neg_labels ) ''

Hope this works. If not, then it's an issue in some other part of the code!

@MHansy
Copy link

MHansy commented May 27, 2021

@naoe1999

Well, I did not try this code. However, I went through the code base of the original implementation ( the winning team) and understood where this labelling came from.

Basically, in the original implementation, the target labels for negative anchor boxes ('neg_labels') are given a label '-1'. Hence it makes sense to write 'neg_labels + 1' during the loss computation to make it 0. ( 0 stands for no object and 1 stands for an object).

However, in the current code base, the target labels for negative anchor boxes are already given a label of '0'. So it doesn't make sense to write 'neg_labels +1' during the loss computation.

In short, I think it's a mistake here and I suggest to run this code without adding 1 to neg_labels.
(''classify_loss = 0.5 * self.classify_loss(
pos_prob, pos_labels[:, 0]) + 0.5 * self.classify_loss(
neg_prob, neg_labels ) ''

Hope this works. If not, then it's an issue in some other part of the code!

Hello kindly help me the testing codes (How to test the training model) so as to get predicted nodules.

@MHansy
Copy link

MHansy commented May 27, 2021

@naoe1999

Well, I did not try this code. However, I went through the code base of the original implementation ( the winning team) and understood where this labelling came from.

Basically, in the original implementation, the target labels for negative anchor boxes ('neg_labels') are given a label '-1'. Hence it makes sense to write 'neg_labels + 1' during the loss computation to make it 0. ( 0 stands for no object and 1 stands for an object).

However, in the current code base, the target labels for negative anchor boxes are already given a label of '0'. So it doesn't make sense to write 'neg_labels +1' during the loss computation.

In short, I think it's a mistake here and I suggest to run this code without adding 1 to neg_labels.
(''classify_loss = 0.5 * self.classify_loss(
pos_prob, pos_labels[:, 0]) + 0.5 * self.classify_loss(
neg_prob, neg_labels ) ''

Hope this works. If not, then it's an issue in some other part of the code!

Testing codes please

@naoe1999
Copy link

@DishaDRao
Thank you for your advice.

However, though I changed the loss function, I couldn't get any meaningful result.
When I trained say 50 epochs, the output value set of 3d grid cell becomes identical to each cell.

Yes, I guess some other part have an issue. Let me point out possible one.

The shape of the output tensor is (32, 32, 32, 3, 5),
which is (# of cells in x-axis, # of cells in y-axis, # of cells in z-axis, # of anchors in each cell, # of values that is x, y, z, r, c)

And the values inside this output tensor is repeated every single cell.
For example, output[0, 0, 0, :, :] == output[l, m, n, :, :] for all l, m, and n. Identical !!
This shouldn't have happened if the model had been trained properly.

I think this is due to very heavily imbalanced positive vs. negative ratio inside the target tensor.
If you have one nodule for a certain 3d-patch, if you look inside the "target" tensor, just only one value out of 3 x 32 x 32 x 32 tensor have a positive value. This makes it 1 : 98303 (3 x 32 x 32 x 32 - 1) imbalanced classification problem!!

After a good iteration of training, it becomes predicting all values to negative.
That is my theory. Well, I'm not sure it's the only issue, but I am quite sure this would be one of the major issues at least.

To solve this, maybe multiple anchor assignment to the GT nodule, and random sampling of negative target cell would be necessary.
I'm just not sure it would be smart to keep working on this code base instead of seeking and moving on to another.

Could you tell me if you have suggestion or any other code base you recommend?

@naoe1999
Copy link

@MHansy

I didn't make test code for this model.
I just got a problem when I finished the training using this code base, which made me stop at that point.

Without solving this, test is meaningless.
The test result would be 0% in detection score (FROC, recall, all the scores), because it would predict all the input as negative!

Anyway, this is my test scheme I was going to do after it gives meaningful output :

  1. You should get patches to cover all lung volume from each validation CT scan first.

  2. Get output prediction from the trained model.

  3. Store all the positive predictions as .csv file (same format as the LUNA16's sampleSubmission.csv)
    NMS (non-maximum suppression) would be necessary for this step

  4. Use noduleCADEvaluationLUNA16.py file to get the test score (FROC and so on).

You can download sampleSubmission.csv and noduleCADEvaluationLUNA16.py from the LUNA16's official site.

@DishaDRao
Copy link
Author

DishaDRao commented May 28, 2021

@naoe1999 @MHansy

The problem of class imbalance is actually taken care of in the loss function. Even though the target lables may contain the ratio (positive to negative) that you have mentioned, the loss function takes care of this by employing 'negative hard mining' (similar to your idea of random sampling of the negatives) which restricts the number of negative anchor boxes to 2 ( depending on the batch size) per mini batch. That means the network sees an equal (or 1:2) ratio of positive and negative anchor boxes during the loss computation.

I strongly believe the problem in this code is how the rest of the targets are labelled. The anchor boxes for bounding box regresssion should be labelled based on its IOU and center-to-center parameterization with a ground truth box (as per the standard faster-rcnn). I don't see how that is employed in this code.

If the target itself doesn't have the right (position) labels, then I wouldn't expect to get any meaningful results after training. ( given the benifit of the doubt, even if the targets are labelled correctly, the testing requires de-parameterization of the predictions which can be done only if the target computation is deciphered)

In short, I wouldn't use this code for traning. This repository is nice to get an understanding on the preprocessing and augmentation part, but for actual implemetation I would recommend to check out the original code bases from (lfz/DSB2017, or 'wentaozhu/DeepLung'). They both are extremely similar, however the latter repo is simpler, it worked for me!

(ps. this repo has a google collab provided at the end. Howerver, I didn't use it nor check it out. I wanted a deeper understanding, hence skipped it entirely ;) )

@naoe1999
Copy link

@DishaDRao

Thank you very much for your advice.
I've followed your recommendation, and it finally works for me too.

I'm using 'wentaozhu/DeepLung' repository for the training & evaluation with LUNA16 dataset, and starts getting meaningful FROC results.
For the segmentation of new CT scan data (they are not provided with segmentation data unlikely to LUNA16), I also found this repository helpful.

Many thanks! :-)

@SirMwan
Copy link

SirMwan commented Jul 17, 2021

@naoe1999 @MHansy

The problem of class imbalance is actually taken care of in the loss function. Even though the target lables may contain the ratio (positive to negative) that you have mentioned, the loss function takes care of this by employing 'negative hard mining' (similar to your idea of random sampling of the negatives) which restricts the number of negative anchor boxes to 2 ( depending on the batch size) per mini batch. That means the network sees an equal (or 1:2) ratio of positive and negative anchor boxes during the loss computation.

I strongly believe the problem in this code is how the rest of the targets are labelled. The anchor boxes for bounding box regresssion should be labelled based on its IOU and center-to-center parameterization with a ground truth box (as per the standard faster-rcnn). I don't see how that is employed in this code.

If the target itself doesn't have the right (position) labels, then I wouldn't expect to get any meaningful results after training. ( given the benifit of the doubt, even if the targets are labelled correctly, the testing requires de-parameterization of the predictions which can be done only if the target computation is deciphered)

In short, I wouldn't use this code for traning. This repository is nice to get an understanding on the preprocessing and augmentation part, but for actual implemetation I would recommend to check out the original code bases from (lfz/DSB2017, or 'wentaozhu/DeepLung'). They both are extremely similar, however the latter repo is simpler, it worked for me!

(ps. this repo has a google collab provided at the end. Howerver, I didn't use it nor check it out. I wanted a deeper understanding, hence skipped it entirely ;) )

Hello @DishaDRao and @naoe1999

Kindly help please.

I tried to make follow up on your conversion and advises, and I went through wentaozhu/DeepLung repository and unfortunately at the LOSS CODES I find the same thing at the labels(+1).

BUT during training with that codes, I found that the loss does not decreasing, I am not sure if I have to remove (+1) in labels in the codes.

@SirMwan
Copy link

SirMwan commented Jul 17, 2021

@DishaDRao did you refer at the point below? This is from data.py file from wentaozhu repository.

class LabelMapping(object):
def init(self, config, phase):
self.stride = np.array(config['stride'])
self.num_neg = int(config['num_neg'])
self.th_neg = config['th_neg']
self.anchors = np.asarray(config['anchors'])
self.phase = phase
if phase == 'train':
self.th_pos = config['th_pos_train']
elif phase == 'val':
self.th_pos = config['th_pos_val']

def __call__(self, input_size, target, bboxes, filename):
    stride = self.stride
    num_neg = self.num_neg
    th_neg = self.th_neg
    anchors = self.anchors
    th_pos = self.th_pos
    
    output_size = []
    for i in range(3):
        if input_size[i] % stride != 0:
            print(filename)
        # assert(input_size[i] % stride == 0) 
        output_size.append(int(input_size[i] / stride))  #Nimetoa int
    
    label = -1 * np.ones(output_size + [len(anchors), 5], np.float32)     #badili from np.float32
    offset = ((stride.astype('float')) - 1) / 2
    oz = np.arange(offset, offset + stride * (output_size[0] - 1) + 1, stride)
    oh = np.arange(offset, offset + stride * (output_size[1] - 1) + 1, stride)
    ow = np.arange(offset, offset + stride * (output_size[2] - 1) + 1, stride)

    for bbox in bboxes:
        for i, anchor in enumerate(anchors):
            iz, ih, iw = select_samples(bbox, anchor, th_neg, oz, oh, ow)
            label[iz, ih, iw, i, 0] = 0

    if self.phase == 'train' and self.num_neg > 0:
        neg_z, neg_h, neg_w, neg_a = np.where(label[:, :, :, :, 0] == -1)
        neg_idcs = random.sample(range(len(neg_z)), min(num_neg, len(neg_z)))
        neg_z, neg_h, neg_w, neg_a = neg_z[neg_idcs], neg_h[neg_idcs], neg_w[neg_idcs], neg_a[neg_idcs]
        label[:, :, :, :, 0] = 0
        label[neg_z, neg_h, neg_w, neg_a, 0] = -1

    if np.isnan(target[0]):
        return label
    iz, ih, iw, ia = [], [], [], []
    for i, anchor in enumerate(anchors):
        iiz, iih, iiw = select_samples(target, anchor, th_pos, oz, oh, ow)
        iz.append(iiz)
        ih.append(iih)
        iw.append(iiw)
        ia.append(i * np.ones((len(iiz),), np.int64))
    iz = np.concatenate(iz, 0)
    ih = np.concatenate(ih, 0)
    iw = np.concatenate(iw, 0)
    ia = np.concatenate(ia, 0)
    flag = True 
    if len(iz) == 0:
        pos = []
        for i in range(3):
            pos.append(max(0, int(np.round((target[i] - offset) / stride))))
        idx = np.argmin(np.abs(np.log(target[3] / anchors)))
        pos.append(idx)
        flag = False
    else:
        idx = random.sample(range(len(iz)), 1)[0]
        pos = [iz[idx], ih[idx], iw[idx], ia[idx]]
    dz = (target[0] - oz[pos[0]]) / anchors[pos[3]]
    dh = (target[1] - oh[pos[1]]) / anchors[pos[3]]
    dw = (target[2] - ow[pos[2]]) / anchors[pos[3]]
    dd = np.log(target[3] / anchors[pos[3]])
    label[pos[0], pos[1], pos[2], pos[3], :] = [1, dz, dh, dw, dd]
    return label        

@DishaDRao
Copy link
Author

DishaDRao commented Jul 17, 2021

@naoe1999 @MHansy
The problem of class imbalance is actually taken care of in the loss function. Even though the target lables may contain the ratio (positive to negative) that you have mentioned, the loss function takes care of this by employing 'negative hard mining' (similar to your idea of random sampling of the negatives) which restricts the number of negative anchor boxes to 2 ( depending on the batch size) per mini batch. That means the network sees an equal (or 1:2) ratio of positive and negative anchor boxes during the loss computation.
I strongly believe the problem in this code is how the rest of the targets are labelled. The anchor boxes for bounding box regresssion should be labelled based on its IOU and center-to-center parameterization with a ground truth box (as per the standard faster-rcnn). I don't see how that is employed in this code.
If the target itself doesn't have the right (position) labels, then I wouldn't expect to get any meaningful results after training. ( given the benifit of the doubt, even if the targets are labelled correctly, the testing requires de-parameterization of the predictions which can be done only if the target computation is deciphered)
In short, I wouldn't use this code for traning. This repository is nice to get an understanding on the preprocessing and augmentation part, but for actual implemetation I would recommend to check out the original code bases from (lfz/DSB2017, or 'wentaozhu/DeepLung'). They both are extremely similar, however the latter repo is simpler, it worked for me!
(ps. this repo has a google collab provided at the end. Howerver, I didn't use it nor check it out. I wanted a deeper understanding, hence skipped it entirely ;) )

Hello @DishaDRao and @naoe1999

Kindly help please.

I tried to make follow up on your conversion and advises, and I went through wentaozhu/DeepLung repository and unfortunately at the LOSS CODES I find the same thing at the labels(+1).

BUT during training with that codes, I found that the loss does not decreasing, I am not sure if I have to remove (+1) in labels in the codes.

Hi,

If you're following wentaozhu/DeepLung respository, you need not change anything in the loss function nor in data.py function. The negative samples are labelled in a correct manner. As mentioned in my previous comment, the +1 in the loss function is to make the nagative labels to 0. So, it's for a purpose!
Whereas in this repo (mostafa/Luna16) that +1 would be a mistake as the negative samples are not lablelled in a manner how data.py function does in the other repo!

So, through Wentazo's codes, the loss error that you're facing must be due to be something else. Probably your dataset/training method. May be you should look into their issues section.

@SirMwan
Copy link

SirMwan commented Jul 17, 2021

So, through Wentazo's codes, the loss error that you're facing must be due to be something else. Probably your dataset/training method. May be you should look into their issues section.

@DishaDRao I am training through google collab, what I have done is to reduce batch size, also what else I have done is I am not using Dataparalle in training because I use single gpu.

Furthermore, chenges in the pytorch version must have some issues like int issues need to put in some areas. Ihave done it for almost two months now Iam getting crayz. I started the process again and again but no success.

If u dont mind, share with me your data.py, main.py and layers.py files.
my email is [email protected]
Thanks in advance.

@SirMwan
Copy link

SirMwan commented Jul 17, 2021

So, through Wentazo's codes, the loss error that you're facing must be due to be something else. Probably your dataset/training method. May be you should look into their issues section.

@DishaDRao I am training through google collab, what I have done is to reduce batch size, also what else I have done is I am not using Dataparalle in training because I use single gpu.

Furthermore, chenges in the pytorch version must have some issues like int issues need to put in some areas. Ihave done it for almost two months now Iam getting crayz. I started the process again and again but no success.

If u dont mind, share with me your data.py, main.py and layers.py files.
my email is [email protected]
Thanks in advance.

This is the change in training I have done

def train(data_loader, net, loss, epoch, optimizer, get_lr, save_freq, save_dir):
start_time = time.time()

net.train()
lr = get_lr(epoch)
for param_group in optimizer.param_groups:
    param_group['lr'] = lr

metrics = []

for i, (data, target, coord) in enumerate(data_loader):
    if torch.cuda.is_available():
        data = Variable(data.cuda())
        target = Variable(target.cuda())
        coord = Variable(coord.cuda())
    data = data.float()
    target = target.float()
    coord = coord.float()


    optimizer.zero_grad()
    output = net(data, coord)
    loss_output = loss(output, target)
    loss_output[0].backward()
    optimizer.step()

    loss_output[0] = loss_output[0].item()    ####changes this part
    metrics.append(loss_output)

if epoch % args.save_freq == 0:            
    state_dict = net.state_dict()
    for key in state_dict.keys():
        state_dict[key] = state_dict[key].cpu()
        
    torch.save({
        'epoch': epoch,
        'save_dir': save_dir,
        'state_dict': state_dict,
        'args': args},
        os.path.join(save_dir, '%03d.ckpt' % epoch))

end_time = time.time()
metrics = np.asarray(metrics, np.float32)
print('Epoch %03d (lr %.5f)' % (epoch, lr))
print('Train:      tpr %3.2f, tnr %3.2f, total pos %d, total neg %d, time %3.2f' % (
    100.0 * np.sum(metrics[:, 6]) / np.sum(metrics[:, 7]),
    100.0 * np.sum(metrics[:, 8]) / np.sum(metrics[:, 9]),
    np.sum(metrics[:, 7]),
    np.sum(metrics[:, 9]),
    end_time - start_time))
print('loss %2.4f, classify loss %2.4f, regress loss %2.4f, %2.4f, %2.4f, %2.4f' % (
    np.mean(metrics[:, 0]),
    np.mean(metrics[:, 1]),
    np.mean(metrics[:, 2]),
    np.mean(metrics[:, 3]),
    np.mean(metrics[:, 4]),
    np.mean(metrics[:, 5])))
print()

@SirMwan
Copy link

SirMwan commented Jul 17, 2021

In the data file also,

...
else:
imgs = np.load(self.filenames[idx])
bboxes = self.sample_bboxes[idx]
nz, nh, nw = imgs.shape[1:]
pz = int(np.ceil(float(nz) / self.stride)) * self.stride
ph = int(np.ceil(float(nh) / self.stride)) * self.stride
pw = int(np.ceil(float(nw) / self.stride)) * self.stride
imgs = np.pad(imgs, [[0,0],[0, pz - nz], [0, ph - nh], [0, pw - nw]], 'constant',constant_values = self.pad_value)

        xx,yy,zz = np.meshgrid(np.linspace(-0.5,0.5,int(imgs.shape[1]/self.stride)),   ##added int
                               np.linspace(-0.5,0.5,int(imgs.shape[2]/self.stride)),                 ##added int
                               np.linspace(-0.5,0.5,int(imgs.shape[3]/self.stride)),indexing ='ij')      ###added int
        coord = np.concatenate([xx[np.newaxis,...], yy[np.newaxis,...],zz[np.newaxis,:]],0).astype('float32')
        imgs, nzhw = self.split_comber.split(imgs)
        coord2, nzhw2 = self.split_comber.split(coord,
                                               side_len = int(self.split_comber.side_len/self.stride),
                                               max_stride = int(self.split_comber.max_stride/self.stride),
                                               margin = int(self.split_comber.margin/self.stride))
        assert np.all(nzhw==nzhw2)
        imgs = (imgs.astype(np.float32)-128)/128
        return torch.from_numpy(imgs), bboxes, torch.from_numpy(coord2), np.array(nzhw)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants