For NCEDense, BCE decreases, SoftmaxCE increases #684

zjost · 2019-04-29T21:06:45Z

zjost
Apr 29, 2019

I have implemented something like the following example in the docs:

# network with sampling for training
encoder = Encoder(..)
decoder = NCEDense(..)
train_net.add(encoder)
train_net.add(decoder)
loss_train = SigmoidBinaryCrossEntropyLoss()

# training
for x, y, sampled_values in train_batches:
    pred, new_targets = train_net(x, sampled_values, y)
    l = loss_train(pred, new_targets)

# network for testing
test_net.add(encoder)
test_net.add(Dense(..., params=decoder.params))
loss_test = SoftmaxCrossEntropyLoss()

# testing
for x, y in test_batches:
    pred = test_net(x)
    l = loss_test(pred, y)

Where I train by providing context inputs (x) and either true or fake targets (y). The BCE decreases for both the training and holdout set.

However, if I add a Dense layer using the decoder params and calculate the SoftmaxCE, this value increases with time for both the training and holdout sets.

How can it be that the network learns to discriminate between real and noise (context, target) pairs, but then using the NCEDense params in a dense layer to create input into Softmax gets worse?

Answered by zjost

May 3, 2019

I have worked on this more and discovered that if I allow training to keep running, eventually the SoftMaxCE will start to improve. It seems that the loss curve always increases for the first several epochs, but eventually starts reducing.

I'm still curious to understand why this happens, and if there are better ways to e.g. schedule the learning rate to get improved convergence. However, I don't think there's an issue with the code/implementation, so this issue can be closed. I apologize for the false alarm.

View full answer

eric-haibin-lin · 2019-04-30T03:09:13Z

eric-haibin-lin
Apr 30, 2019
Maintainer

@zjost thanks for posting the code sample. Are you saying that in this example loss_train keeps decreasing but loss_test increases over time? What sampler are you using to generate negatives?

0 replies

zjost · 2019-04-30T15:18:56Z

zjost
Apr 30, 2019
Author

@eric-haibin-lin that's exactly right: loss_train decreases for both training and holdout datasets (so it doesn't seem to be an overfitting problem), but loss_test increases for both training and holdout datasets.

I had to implement my own sampler because I couldn't find any pre-made implementations that generated the tuple of three tensors needed for sampled_values. So I made a thin wrapper around the UnigramCandidateSampler that returns those sample candidates, along with the empirical probabilities for each of the true and sampled classes, as observed in the training data.

0 replies

eric-haibin-lin · 2019-04-30T15:39:58Z

eric-haibin-lin
Apr 30, 2019
Maintainer

@zjost Got it. Are you returning the probability of candidates, or the expected count of candidates? The NCEBlock expects the expected count as inputs. Can you try returning probability*num_candidate instead?

Another sampler you can try is the LogUniformSampler in https://github.com/dmlc/gluon-nlp/blob/master/scripts/language_model/sampler.py#L24 . You can find an example usage here: https://github.com/dmlc/gluon-nlp/blob/master/scripts/language_model/large_word_language_model.py
Note that the LogUniformSampler expects word ids in sorted order based on their frequencies.

0 replies

zjost · 2019-05-01T18:25:22Z

zjost
May 1, 2019
Author

@eric-haibin-lin I have re-implemented everything on a new, simpler problem and am getting the same behavior. I have made the correction to return sample_probability*num_samples for the expected_count_sampled and expected_count_true in my sampler, and have verified that the samples I'm getting approximate the actual global distribution. I also tested the LogUniformSampler.

As a comparison, I also trained a network with a Dense layer decoder and directly minimized the Softmax loss function without using any kind of negative sampling. On the same dataset, this loss decreases as expected. This at least shows that the learning task is possible on this data.

Let me share some code.
Sampler:

class MySampler(gluon.HybridBlock):
    def __init__(self, sample_weights, k_noise):
        super(MySampler, self).__init__(prefix='mySampler_')
        
        self.sample_weights = sample_weights
        self.k_noise = k_noise
        
        prob = mx.nd.array(sample_weights.astype(float)/sample_weights.sum())
        self.prob = self.params.get_constant('prob', prob)
        
        with self.name_scope():
            self.unigram = UnigramCandidateSampler(
                mx.nd.array(self.sample_weights), 
                (k_noise,1))

               
    def hybrid_forward(self, F, y_true, prob):
        """
        Returns:  
          sampled_classes: with shape (num_samples,)
          expected_count_sampled: with shape (num_samples,)
          expected_count_true: with shape (sequence_length, batch_size).
        
        """
        # Input to self.unigram isn't actually used
        sampled_classes = self.unigram(F.ones(1)).reshape((-1,))
        
        expected_count_sampled = F.take(prob, sampled_classes)*self.k_noise
        expected_count_true = F.take(prob, y_true)*self.k_noise
        
        return sampled_classes, expected_count_sampled, expected_count_true.reshape((1,-1))

NCE network:

class SimpleNCE(gluon.HybridBlock):
    def __init__(self, sample_weights, input_dim, embed_dim, k_noise):
        super(SimpleNCE, self).__init__(prefix='simpleNCE_')

        self.sample_weights = sample_weights
        
        with self.name_scope():
            self.embed = gluon.nn.Embedding(
                input_dim=input_dim, 
                output_dim=embed_dim, 
                prefix='embed_')
            
            self.nce_dense = NCEDense(
                num_classes=sample_weights.shape[0],
                num_sampled=k_noise,
                in_unit=embed_dim,
                remove_accidental_hits=True,
                prefix='nce_')

            self.sampler = MySampler(sample_weights, k_noise)

    
    def hybrid_forward(self, F, x, y):  
        # Get negative samples 
        sampled_classes, expected_count_sampled, expected_count_true = self.sampler(y)
        
        embeddings = self.embed(x)

        out, new_targets = self.nce_dense(
            embeddings,
            [sampled_classes, expected_count_sampled, expected_count_true],
            y.reshape((-1,1))
        )

        return out, new_targets

Training

def evaluate_network(network, data_iterator, ctx, bce=True):
    loss_acc = 0.
    if bce:
        loss = gluon.loss.SigmoidBCELoss()
    else:
        # Construct new network to predict via softmax
        net = gluon.nn.HybridSequential()
        net.add(network.embed)
        net.add(gluon.nn.Dense(network.sample_weights.shape[0], params=network.nce_dense.params))
        loss = gluon.loss.SoftmaxCrossEntropyLoss()

    for idx, (X, y) in enumerate(data_iterator):
        X_ = gluon.utils.split_and_load(X, ctx)
        y_ = gluon.utils.split_and_load(y, ctx)
        
        losses = []
        for c_, t_ in zip(X_, y_):
            if bce:
                out_, new_target_ = network(c_,t_)
                losses += [loss(out_, new_target_)]    
            else:
                preds = net(c_)
                losses += [loss(preds, mx.nd.cast(t_, 'int32'))]
                
                
        loss_acc += sum([l.asnumpy() for l in losses]).mean() / len(ctx)

    return loss_acc / float(idx + 1)

def train(network, training, validation, epochs, learning_rate=0.01,
          optimizer='adam', wd=1e-5, ctx=[mx.gpu(0)]):

    np.random.seed(123)  # Fix random seed for consistent demos
    mx.random.seed(123)  # Fix random seed for consistent demos

    trainer = gluon.Trainer(network.collect_params(), optimizer,
                            {'learning_rate': learning_rate, 'wd': wd})

    loss = gluon.loss.SigmoidBCELoss()

    network.hybridize()

    losses_output = []

    dev_loss_bce = evaluate_network(network, validation, [ctx[0]])
    dev_loss_sm = evaluate_network(network, validation, [ctx[0]], bce=False)
    print("Baseline: Dev BCE={:.4f}, Dev SM={:.4f}".format(dev_loss_bce, dev_loss_sm))

    for e in range(epochs):
        loss_acc = 0.
        for idx, (X, y) in enumerate(training):
            X_ = gluon.utils.split_and_load(X, ctx)
            y_ = gluon.utils.split_and_load(y, ctx)
        
            with autograd.record():
                losses = []
                for c_, t_ in zip(X_, y_):
                    out_, new_target_ = network(c_,t_)
                    losses.append(loss(out_, new_target_))

            [l.backward() for l in losses]
            loss_acc += sum([l.asnumpy() for l in losses]).mean() / len(ctx)
            trainer.step(X.shape[0])

        dev_loss_bce = evaluate_network(network, validation, [ctx[0]])
        dev_loss_sm = evaluate_network(network, validation, [ctx[0]], bce=False)
        
        train_loss_sm = evaluate_network(network, training, [ctx[0]], bce=False)
        train_loss = loss_acc / float(idx + 1)

        print(
            "Epoch [{}], Train BCE={:.4f}, Dev BCE={:.4f}, Train SM={:.4f}, Dev SM={:.4f}".format(
                e+1, train_loss, dev_loss_bce, train_loss_sm, dev_loss_sm)
        )
        losses_output.append((train_loss, dev_loss_bce))
    return losses_output

Am I using the right parameterization for gluon.loss.SoftmaxCrossEntropyLoss(), e.g. sparse_label=True, from_logits=False, which are the defaults.

0 replies

zjost · 2019-05-03T17:21:57Z

zjost
May 3, 2019
Author

I have worked on this more and discovered that if I allow training to keep running, eventually the SoftMaxCE will start to improve. It seems that the loss curve always increases for the first several epochs, but eventually starts reducing.

I'm still curious to understand why this happens, and if there are better ways to e.g. schedule the learning rate to get improved convergence. However, I don't think there's an issue with the code/implementation, so this issue can be closed. I apologize for the false alarm.

1 reply

tmacraft Sep 24, 2020

Hi I have the same problem here and would like to see if you can provide a full example of your implementation, i am not sure what value should I pass as sample_weights and k_noise in your sampler class. thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

For NCEDense, BCE decreases, SoftmaxCE increases #684

{{title}}

Replies: 5 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

For NCEDense, BCE decreases, SoftmaxCE increases #684

zjost Apr 29, 2019

Replies: 5 comments · 1 reply

eric-haibin-lin Apr 30, 2019 Maintainer

zjost Apr 30, 2019 Author

eric-haibin-lin Apr 30, 2019 Maintainer

zjost May 1, 2019 Author

zjost May 3, 2019 Author

tmacraft Sep 24, 2020

zjost
Apr 29, 2019

Replies: 5 comments 1 reply

eric-haibin-lin
Apr 30, 2019
Maintainer

zjost
Apr 30, 2019
Author

eric-haibin-lin
Apr 30, 2019
Maintainer

zjost
May 1, 2019
Author

zjost
May 3, 2019
Author