Getting NaN loss after some time with DeepAR and NegativeBinomial #833

fernandocamargoai · 2020-05-18T14:40:22Z

Description

I have a dataset with 129k time series of demand for products within some branches of a retailer. I'm having trouble with NaN loss ever since I started using GluonTS. At first, I discovered some NaN and Infinity values within my FEAT_DYNAMIC_REAL. But even after solving those issues, I'm still getting NaN loss after some time. With the new release (which included some improvements on the NegativeBinomial, I'm getting a much lower loss value and it's taking longer for the problem to occur, but it still occurs.

I decided to put a conditional breakpoint in the log_prob() to stop whenever a NaN value is generated and used hybridize=False. It's super strange, but the error didn't happen yet (It's far past the point that it usually happens). I'll wait some more time, but it might be the case that it only happens with hybridize=True? If that's the case, it'd be very odd and problematic, since it runs much slower with hybridize=False.

To Reproduce

Unfortunately, I can't provide the dataset to reproduce the problem. But here are some information:

Number of Time Series: 128.976
TARGET: Count values, with a lot of zeros
FEAT_STATIC_CAT: I'm using categorical features that identify the product (its id, its category, subcategory, and so on) and the branch (its id, its city, state, and so on).
FEAT_DYNAMIC_REAL: I'm using holidays in one-hot encoding, a binary identifying if there's a special price and the relative discount (calculated using the discount and the price).
batch_size: 32
num_batches_per_epoch: 10000
learning_rate: 1e-4
clip_gradient: 1.0
patience: 10
minimum_learning_rate: 5e-5
weight_decay: 1e-8
num_layers: 2
num_cells: 100
embedding_dimension: [2, 7, 20, 50, 10, 30, 60, 80, 200]
context_length: 90
scaling: True
freq: 1D

put code here

Error message or code output

Here is one example where it happened in the first epoch. But sometimes it happened later.

Traceback (most recent call last):
  File "/opt/anaconda3/envs/gazin/lib/python3.6/site-packages/luigi/worker.py", line 199, in run
    new_deps = self._run_get_new_deps()
  File "/opt/anaconda3/envs/gazin/lib/python3.6/site-packages/luigi/worker.py", line 141, in _run_get_new_deps
    task_gen = self.task.run()
  File "/root/work/log_inteligente_ai/gazin/task/model/base.py", line 164, in run
    dict(fit=self.fit, cross_validation=self.cross_validate)[self.training_type]()
  File "/root/work/log_inteligente_ai/gazin/task/model/gluonts.py", line 425, in fit
    predictor = self.train_model()
  File "/root/work/log_inteligente_ai/gazin/task/model/gluonts.py", line 416, in train_model
    train_iter=training_data_loader,
  File "/opt/anaconda3/envs/gazin/lib/python3.6/site-packages/gluonts/trainer/_base.py", line 328, in __call__
    "Got NaN in first epoch. Try reducing initial learning rate."
gluonts.core.exception.GluonTSUserError: Got NaN in first epoch. Try reducing initial learning rate.

Environment

Operating system: Linux
Python version: 3.6
GluonTS version: 0.5.0

The text was updated successfully, but these errors were encountered:

samosun · 2020-05-19T03:02:12Z

maybe lower num_batches_per_epoch? I guess learning_rate is already low but the num_batches_per_epoch is high, so that during the first epoch training, arguments of distribution had numerical problems in calculating log prob.

fernandocamargoai · 2020-05-19T10:54:37Z

@samosun, my logic in increasing the num_batches_per_epoch is that I think each time-series should have a more or less equal chance of having a slice of it being part of each epoch. I mean, if I have a batch size of 32, 128976 / 32 = 4030,5. I'm actually using 4000 of num_batches_per_epoch right now (that number was because I had 300k+ time-series before and now I'm filtering some of them). Doesn't it make sense? Because I imagine if I was using a small num_batches_per_epoch, a given time-series could appear once in the first epoch and appear again several epochs later, which doesn't make sense to me.

And I don't understand it would create numerical problems. Isn't the loss computed per batch, anyway? In my understanding, the gradient descendent doesn't even "know" about the concept of epochs. Only we know because it's when we take the avg_loss and is a possible point of stopping the training.

samosun · 2020-05-19T13:09:09Z

In my understanding, the gradient descendent doesn't even "know" about the concept of epochs.

yes, I think you are right. num_batches_per_epoch is not the root cause.
I thought if arguments mu,alpha in NegativeBinomial was projected into "boundary values" like 0, it will be failed in calculating log prob, but I noticed that they had solved this problem by adding epsilon, see
https://github.com/awslabs/gluon-ts/blob/master/src/gluonts/distribution/neg_binomial.py#L114

so there may be other mysteries to be found if it's reproducible

import mxnet as mx
from gluonts.distribution import NegativeBinomialOutput

distr_output = NegativeBinomialOutput()
distr = distr_output.distribution(
    (mx.nd.array([0]), # if mu=0
     mx.nd.array([1]))
)
x = mx.nd.array([0])
ll = distr.log_prob(x)
print(ll) # output nan

fernandocamargoai · 2020-05-19T15:53:29Z

@samosun, I think I can confirm that it only happens with hybridize=True. I left it training with hybridize=False and it's in the 22th epoch now without NaN. Also, the model converges much faster with hybridize=False (0.147 at first epoch vs 0.28 at 4th epoch and then the NaN happened. There shouldn't be any difference besides speed with hybridize, but it seems that's something else happening.

lostella · 2020-05-20T06:40:13Z

@fernandocamargoti are you using mxnet 1.6.0? And the DeepAREstimator?

It’s very weird indeed that this shows up with hybridize=True only. One place where things could go wrong is this https://github.com/awslabs/gluon-ts/blob/0d963f7dc55ef866d86e33633a28d57dfab33adb/src/gluonts/distribution/neg_binomial.py#L114 but still this should happen with and without hybridization, unless some subtle bug in mxnet is there

lostella · 2020-05-20T06:45:57Z

TARGET: Count values, with a lot of zeros

What’s the fraction of zero values, roughly? It would be nice if we could to reproduce the issue

fernandocamargoai · 2020-05-20T10:38:07Z

@lostella, yes, I'm using DeepAREstimator with mxnet 1.6.0. I'll get the fraction and get back to you today.

Look at how faster the training with hybridize=False converges:

100%|###################################################################################################################################################################| 4000/4000 [58:16<00:00,  1.14it/s, epoch=1/4, avg_epoch_loss=0.219]
2020-05-19 21:53:41,841 : INFO : Epoch[0] Elapsed time 3496.287 seconds
2020-05-19 21:53:41,842 : INFO : Epoch[0] Evaluation metric 'epoch_loss'=0.218906
2020-05-19 21:53:41,853 : INFO : Epoch[1] Learning rate is 0.0001
100%|###################################################################################################################################################################| 4000/4000 [59:31<00:00,  1.12it/s, epoch=2/4, avg_epoch_loss=0.164]
2020-05-19 22:53:13,262 : INFO : Epoch[1] Elapsed time 3571.409 seconds
2020-05-19 22:53:13,262 : INFO : Epoch[1] Evaluation metric 'epoch_loss'=0.164086
2020-05-19 22:53:13,273 : INFO : Epoch[2] Learning rate is 0.0001
100%|###################################################################################################################################################################| 4000/4000 [59:35<00:00,  1.12it/s, epoch=3/4, avg_epoch_loss=0.155]
2020-05-19 23:52:48,752 : INFO : Epoch[2] Elapsed time 3575.479 seconds
2020-05-19 23:52:48,752 : INFO : Epoch[2] Evaluation metric 'epoch_loss'=0.154580
2020-05-19 23:52:48,761 : INFO : Epoch[3] Learning rate is 0.0001
100%|###################################################################################################################################################################| 4000/4000 [59:29<00:00,  1.12it/s, epoch=4/4, avg_epoch_loss=0.151]
2020-05-20 00:52:18,165 : INFO : Epoch[3] Elapsed time 3569.404 seconds
2020-05-20 00:52:18,166 : INFO : Epoch[3] Evaluation metric 'epoch_loss'=0.150852
2020-05-20 00:52:18,180 : INFO : Loading parameters from best epoch (3)
2020-05-20 00:52:18,189 : INFO : Final loss: 0.15085175694245845 (occurred at epoch 3)

Compared to the same training (every single hyperparameter is equal and I have seed in the random, numpy.random and mxnet) with hybridize=True:

100%|██████████| 4000/4000 [22:33<00:00,  2.95it/s, epoch=1/100, avg_epoch_loss=0.348]
2020-05-19 18:29:27,221 : INFO : Epoch[0] Elapsed time 1353.823 seconds
2020-05-19 18:29:27,223 : INFO : Epoch[0] Evaluation metric 'epoch_loss'=0.347671
2020-05-19 18:29:27,245 : INFO : Epoch[1] Learning rate is 0.0001
100%|██████████| 4000/4000 [34:43<00:00,  1.92it/s, epoch=2/100, avg_epoch_loss=0.289]
2020-05-19 19:04:10,699 : INFO : Epoch[1] Elapsed time 2083.454 seconds
2020-05-19 19:04:10,699 : INFO : Epoch[1] Evaluation metric 'epoch_loss'=0.289498
2020-05-19 19:04:10,706 : INFO : Epoch[2] Learning rate is 0.0001
100%|██████████| 4000/4000 [34:49<00:00,  1.91it/s, epoch=3/100, avg_epoch_loss=0.281]
2020-05-19 19:39:00,636 : INFO : Epoch[2] Elapsed time 2089.929 seconds
2020-05-19 19:39:00,636 : INFO : Epoch[2] Evaluation metric 'epoch_loss'=0.280888
2020-05-19 19:39:00,643 : INFO : Epoch[3] Learning rate is 0.0001
100%|██████████| 4000/4000 [44:04<00:00,  1.51it/s, epoch=4/100, avg_epoch_loss=0.275]
2020-05-19 20:23:05,000 : INFO : Epoch[3] Elapsed time 2644.357 seconds
2020-05-19 20:23:05,000 : INFO : Epoch[3] Evaluation metric 'epoch_loss'=0.274745
2020-05-19 20:23:05,007 : INFO : Epoch[4] Learning rate is 0.0001
100%|██████████| 4000/4000 [45:46<00:00,  1.46it/s, epoch=5/100, avg_epoch_loss=0.273]
2020-05-19 21:08:51,684 : INFO : Epoch[4] Elapsed time 2746.677 seconds
2020-05-19 21:08:51,684 : INFO : Epoch[4] Evaluation metric 'epoch_loss'=0.273488
2020-05-19 21:08:51,690 : INFO : Epoch[5] Learning rate is 0.0001
100%|██████████| 4000/4000 [45:50<00:00,  1.45it/s, epoch=6/100, avg_epoch_loss=0.267]
2020-05-19 21:54:41,773 : INFO : Epoch[5] Elapsed time 2750.083 seconds
2020-05-19 21:54:41,773 : INFO : Epoch[5] Evaluation metric 'epoch_loss'=0.266905
2020-05-19 21:54:41,779 : INFO : Epoch[6] Learning rate is 0.0001
100%|██████████| 4000/4000 [46:00<00:00,  1.45it/s, epoch=7/100, avg_epoch_loss=0.269]
2020-05-19 22:40:42,483 : INFO : Epoch[6] Elapsed time 2760.703 seconds
2020-05-19 22:40:42,483 : INFO : Epoch[6] Evaluation metric 'epoch_loss'=0.268658
2020-05-19 22:40:42,483 : INFO : Epoch[7] Learning rate is 0.0001
100%|██████████| 4000/4000 [45:41<00:00,  1.46it/s, epoch=8/100, avg_epoch_loss=0.263]
2020-05-19 23:26:24,179 : INFO : Epoch[7] Elapsed time 2741.696 seconds
2020-05-19 23:26:24,179 : INFO : Epoch[7] Evaluation metric 'epoch_loss'=0.263290
2020-05-19 23:26:24,185 : INFO : Epoch[8] Learning rate is 0.0001
100%|██████████| 4000/4000 [45:52<00:00,  1.45it/s, epoch=9/100, avg_epoch_loss=0.266]
2020-05-20 00:12:16,630 : INFO : Epoch[8] Elapsed time 2752.445 seconds
2020-05-20 00:12:16,630 : INFO : Epoch[8] Evaluation metric 'epoch_loss'=0.265814
2020-05-20 00:12:16,630 : INFO : Epoch[9] Learning rate is 0.0001
100%|██████████| 4000/4000 [46:12<00:00,  1.44it/s, epoch=10/100, avg_epoch_loss=0.262]
2020-05-20 00:58:29,082 : INFO : Epoch[9] Elapsed time 2772.452 seconds
2020-05-20 00:58:29,082 : INFO : Epoch[9] Evaluation metric 'epoch_loss'=0.262476
2020-05-20 00:58:29,088 : INFO : Epoch[10] Learning rate is 0.0001
100%|██████████| 4000/4000 [45:49<00:00,  1.45it/s, epoch=11/100, avg_epoch_loss=0.263]
2020-05-20 01:44:18,254 : INFO : Epoch[10] Elapsed time 2749.166 seconds
2020-05-20 01:44:18,254 : INFO : Epoch[10] Evaluation metric 'epoch_loss'=0.262678
2020-05-20 01:44:18,254 : INFO : Epoch[11] Learning rate is 0.0001
100%|██████████| 4000/4000 [45:42<00:00,  1.46it/s, epoch=12/100, avg_epoch_loss=0.264]
2020-05-20 02:30:00,600 : INFO : Epoch[11] Elapsed time 2742.346 seconds
2020-05-20 02:30:00,600 : INFO : Epoch[11] Evaluation metric 'epoch_loss'=0.264322
2020-05-20 02:30:00,600 : INFO : Epoch[12] Learning rate is 0.0001
100%|██████████| 4000/4000 [46:06<00:00,  1.45it/s, epoch=13/100, avg_epoch_loss=0.261]
2020-05-20 03:16:06,984 : INFO : Epoch[12] Elapsed time 2766.384 seconds
2020-05-20 03:16:06,984 : INFO : Epoch[12] Evaluation metric 'epoch_loss'=0.261147
2020-05-20 03:16:06,990 : INFO : Epoch[13] Learning rate is 0.0001
100%|██████████| 4000/4000 [45:48<00:00,  1.46it/s, epoch=14/100, avg_epoch_loss=0.267]
2020-05-20 04:01:55,199 : INFO : Epoch[13] Elapsed time 2748.209 seconds
2020-05-20 04:01:55,200 : INFO : Epoch[13] Evaluation metric 'epoch_loss'=0.266566
2020-05-20 04:01:55,200 : INFO : Epoch[14] Learning rate is 0.0001
100%|██████████| 4000/4000 [45:56<00:00,  1.45it/s, epoch=15/100, avg_epoch_loss=0.262]
2020-05-20 04:47:51,212 : INFO : Epoch[14] Elapsed time 2756.012 seconds
2020-05-20 04:47:51,212 : INFO : Epoch[14] Evaluation metric 'epoch_loss'=0.261856
2020-05-20 04:47:51,212 : INFO : Epoch[15] Learning rate is 0.0001
100%|██████████| 4000/4000 [46:09<00:00,  1.44it/s, epoch=16/100, avg_epoch_loss=0.26]
2020-05-20 05:34:00,381 : INFO : Epoch[15] Elapsed time 2769.169 seconds
2020-05-20 05:34:00,381 : INFO : Epoch[15] Evaluation metric 'epoch_loss'=0.260374
2020-05-20 05:34:00,387 : INFO : Epoch[16] Learning rate is 0.0001
100%|██████████| 4000/4000 [46:02<00:00,  1.45it/s, epoch=17/100, avg_epoch_loss=0.262]
2020-05-20 06:20:02,540 : INFO : Epoch[16] Elapsed time 2762.153 seconds
2020-05-20 06:20:02,540 : INFO : Epoch[16] Evaluation metric 'epoch_loss'=0.262312
2020-05-20 06:20:02,540 : INFO : Epoch[17] Learning rate is 0.0001
100%|██████████| 4000/4000 [46:05<00:00,  1.45it/s, epoch=18/100, avg_epoch_loss=0.26]
2020-05-20 07:06:08,361 : INFO : Epoch[17] Elapsed time 2765.821 seconds
2020-05-20 07:06:08,361 : INFO : Epoch[17] Evaluation metric 'epoch_loss'=0.260257
2020-05-20 07:06:08,369 : INFO : Epoch[18] Learning rate is 0.0001
  0%|          | 13/4000 [00:10<54:16,  1.22it/s, epoch=19/100, avg_epoch_loss=0.269]2020-05-20 07:06:24,829 : WARNING : Epoch[18] gave nan loss
  0%|          | 13/4000 [00:16<1:24:07,  1.27s/it, epoch=19/100, avg_epoch_loss=0.271]
2020-05-20 07:06:24,829 : INFO : Loading parameters from best epoch (17)
2020-05-20 07:06:24,841 : INFO : Epoch[19] Learning rate is 5e-05
  0%|          | 0/4000 [00:00<?, ?it/s]2020-05-20 07:06:26,563 : WARNING : Epoch[19] gave nan loss
  0%|          | 0/4000 [00:01<?, ?it/s, epoch=20/100, avg_epoch_loss=0.232]
2020-05-20 07:06:26,564 : INFO : Stopping training
2020-05-20 07:06:26,564 : INFO : Loading parameters from best epoch (17)
2020-05-20 07:06:26,570 : INFO : Final loss: 0.2602571310577914 (occurred at epoch 17)
2020-05-20 07:06:26,576 : INFO : End model training

fernandocamargoai · 2020-05-22T17:41:27Z

Hello, @lostella. Sorry for taking this long.

I have 132131585 out of 143037878 (92.37%) of zeros.

To give you more information, this dataset is from a retailer with many branches and many products. I have all the sales registered for each item in each branch. Some of these items might be discontinued (like smartphones) after some time. So, each time-series may start on a different date, but I've normalized all of them to finish on the same date. So, despite the fact that it's really normal for time-series to have must of its values equal to zero, there's also those that will never sell anything again, that contribute to the zeroes.

matthiasanderer · 2020-05-23T12:20:52Z

Did you try switching back to mxnet-cu101mkl==1.4.1 - I think with that mxnet version you can leave out the hybridize False option.

I just stumbled on this on my quest to get DeepAR converging somewhere in the M5 Kaggle challenge ... maybe it helps you @fernandocamargoti ?

In case this works for you too something between mxnet 1.4.1 and 1.6 broke ....

lostella · 2020-05-23T12:37:39Z

So, each time-series may start on a different date, but I've normalized all of them to finish on the same date.

@fernandocamargoti as a side note: that that's not strictly needed, and may bias your predictions towards zero. Are you somehow indicating which values you have manually padded by adding some binary feature that indicates it?

lostella · 2020-05-23T12:59:41Z

@fernandocamargoti also a minor note: the logs you pasted for hybridized=True suggest that you're using num_batches_per_epoch=4000, but you mentioned 10_000 (still, it seems to converge too slow even assuming that you used 10_000 for hybridized=False).

fernandocamargoai · 2020-05-28T18:22:20Z

@matthiasanderer, I'm running on IBM Power 9. If I'm not wrong, MKL is for Intel Processors.

fernandocamargoai · 2020-05-28T18:26:06Z

@lostella, thanks for your side note. But I also my predictions to indicate whether a time series will "die" (the product will lose all its demand). I haven't set up any binary feature to indicate it. If I had, how would I fill this feature in production? I mean, a product time series start dying, and at some point, no more sales are made. It would be hard to decide when to put set this flag to True in production.

fernandocamargoai · 2020-05-28T18:28:28Z

@fernandocamargoti also a minor note: the logs you pasted for hybridized=True suggest that you're using num_batches_per_epoch=4000, but you mentioned 10_000 (still, it seems to converge too slow even assuming that you used 10_000 for hybridized=False).

Yeah, it's another version where I filter time series with a minimum number of days with sales. Previously, I filtered with 10 and I increase to 30 to train faster. Then, I've updated the num_batches_per_epoch accordingly. Notice, though, that both logs have 4000.

lostella · 2020-05-28T18:47:36Z

Oh I see, I hadn’t scrolled all the way to the right in the first log, sorry. I also agree on the zero padding you’re doing, since that’s a behavior you want to model 👍🏻

matthiasanderer · 2020-05-29T07:05:46Z

@matthiasanderer, I'm running on IBM Power 9. If I'm not wrong, MKL is for Intel Processors.

Sorry that was my copy paste of course only version 1.4.1 is the relevant part ... I just used that cuda10.1/mkl version...

fernandocamargoai · 2020-05-30T12:29:42Z

@lostella, I think I've got a way for you guys to reproduce the problem. The same thing happens on m5 dataset (https://www.kaggle.com/allunia/m5-uncertainty).

To make it easier, I've already preprocessed the dataset, putting in the GluonTS format. It's available here: https://github.com/fernandocamargoti/m5_competition_gluonts_dataset

The cardinality is [5, 12, 5, 9, 3051]. It always has +2 of the number of the categories because I leave the first index for unknown and the second for None (there are some aggregated time series where some categorical variables aren't defined).

Also, I've been using my own implementation of Dataset instead of FileDataset from GluonTS. The reason is that the FileDataset was reading all the files to calculate the len() (which is called at the start of the training). It was taking a very long time. So, I wanted to start training and let the file be loaded only when necessary. To calculate the len(), I simply count the number of files, without reading them. So, here is my implementation: https://gist.github.com/fernandocamargoti/0f9c0b390ec44e0239a835cff91ae85c

To separate the train and test set, I use:

self._train_dataset = SameSizeTransformedDataset(
                self.dataset, transformations=[FilterTimeSeriesTransformation(start=0, end=split)])

self._test_dataset = SameSizeTransformedDataset(
                self.dataset, transformations=[FilterTimeSeriesTransformation(start=0, end=split + self.test_steps)])

The SameSizeTransformedDataset is also provided in the gist above. It's only used to avoid reading the whole dataset to calculate the len(), allowing me to lazy load it during the training.

fernandocamargoai · 2020-06-01T13:52:56Z

As suggested by @matthiasanderer, I've tried MXNet 1.4.1 and this weird bug didn't happen. I'll also try 1.5.1 and report back.

fernandocamargoai · 2020-06-02T19:25:22Z

The bug doesn't happen with MXNet 1.5.1. So, it's something that happened when 1.6.0 was introduced.

lostella · 2020-06-02T20:51:52Z

@fernandocamargoti thanks for looking into it, this is really cool (but also not cool that the bug is there).

Did you observe that with 1.4 and 1.5 both NaN didn’t show up and convergence speed was unaffected by hybridize True/False?

fernandocamargoai · 2020-06-02T21:35:39Z

@lostella, both converge much faster and the NaN doesn't happen at all. But I've noticed that a weird bug happens with 1.5:

Traceback (most recent call last):
  File "/home/fernandocamargo/anaconda3/envs/gazin/lib/python3.6/site-packages/luigi/worker.py", line 199, in run
    new_deps = self._run_get_new_deps()
  File "/home/fernandocamargo/anaconda3/envs/gazin/lib/python3.6/site-packages/luigi/worker.py", line 141, in _run_get_new_deps
    task_gen = self.task.run()
  File "/home/fernandocamargo/datascience_workspace/log_inteligente_ai/gazin/task/model/gluonts.py", line 685, in run
    test_forecast = cast(SampleForecast, next(iter(forecast_it)))
  File "/home/fernandocamargo/anaconda3/envs/gazin/lib/python3.6/site-packages/gluonts/model/predictor.py", line 330, in predict
    num_samples=num_samples,
  File "/home/fernandocamargo/anaconda3/envs/gazin/lib/python3.6/site-packages/gluonts/model/forecast_generator.py", line 197, in __call__
    outputs = prediction_net(*inputs).asnumpy()
  File "/home/fernandocamargo/anaconda3/envs/gazin/lib/python3.6/site-packages/mxnet/gluon/block.py", line 548, in __call__
    out = self.forward(*args)
  File "/home/fernandocamargo/anaconda3/envs/gazin/lib/python3.6/site-packages/mxnet/gluon/block.py", line 925, in forward
    return self.hybrid_forward(ndarray, x, *args, **params)
  File "/home/fernandocamargo/anaconda3/envs/gazin/lib/python3.6/site-packages/gluonts/model/deepar/_network.py", line 608, in hybrid_forward
    begin_states=state,
  File "/home/fernandocamargo/anaconda3/envs/gazin/lib/python3.6/site-packages/gluonts/model/deepar/_network.py", line 541, in sampling_decoder
    new_samples = distr.sample(dtype=self.dtype)
  File "/home/fernandocamargo/anaconda3/envs/gazin/lib/python3.6/site-packages/gluonts/distribution/neg_binomial.py", line 98, in sample
    s, mu=self.mu, alpha=self.alpha, num_samples=num_samples
  File "/home/fernandocamargo/anaconda3/envs/gazin/lib/python3.6/site-packages/gluonts/distribution/distribution.py", line 339, in _sample_multiple
    samples = sample_func(*args_expanded, **kwargs_expanded)
  File "/home/fernandocamargo/anaconda3/envs/gazin/lib/python3.6/site-packages/gluonts/distribution/neg_binomial.py", line 94, in s
    x = F.minimum(F.random.gamma(r, theta), 1e6)
  File "/home/fernandocamargo/anaconda3/envs/gazin/lib/python3.6/site-packages/mxnet/ndarray/random.py", line 383, in gamma
    [alpha, beta], shape, dtype, ctx, out, kwargs)
  File "/home/fernandocamargo/anaconda3/envs/gazin/lib/python3.6/site-packages/mxnet/ndarray/random.py", line 38, in _random_helper
    return sampler(*params, shape=shape, dtype=dtype, out=out, **kwargs)
  File "<string>", line 68, in _sample_gamma
  File "/home/fernandocamargo/anaconda3/envs/gazin/lib/python3.6/site-packages/mxnet/_ctypes/ndarray.py", line 92, in _imperative_invoke
    ctypes.byref(out_stypes)))
  File "/home/fernandocamargo/anaconda3/envs/gazin/lib/python3.6/site-packages/mxnet/base.py", line 253, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: vector::_M_range_insert

I happens both when I try to run the evaluation and if I load a model I try to use it. So, I'm sticking with 1.4.1 for now.

lostella · 2020-06-08T16:56:52Z

@fernandocamargoti the issue you’re facing with mxnet 1.5 is a known one on Linux: apache/mxnet#16135

This was fixed in 1.6 but the fix was not backported to 1.5 I believe. This is also why we skipped support for 1.5 in GluonTS, and directly jumped to 1.6

lostella · 2020-06-08T18:50:02Z

@fernandocamargoti did you mention you were running on GPU? In that case this may be related: #728

fernandocamargoai · 2020-06-15T17:47:15Z

@lostella, it could definitely be related. But #728 only fixes for binned distribution, right? I guess the same problem happens with other distributions, like Negative Binomial that I'm using.

lostella · 2021-01-12T20:37:00Z

I'm wondering whether this still shows up with MXNet 1.7? (since that fixed the bug concerning #728) @fernandocamargoti not sure you're able to verify this, otherwise we'll try to reproduce with your instructions above

fernandocamargoai · 2021-02-08T12:30:22Z

Interesting, @GabrielDeza. My dynamic features are normalized as well. They're too important for my prediction for me to simply remove them, though. I also noticed that the resulting model that was trained after the NaNs, even though the logs say that the batch will be ignored, also result in NaNs in production. I didn't want to downgrade everything again after going through the hassle of upgrading it. Especially because MXNet 1.4.1 is very old now and requires an old version of numpy. But it seems that I, unfortunately, have no choice. It's very frustrating.

fernandocamargoai · 2021-03-02T12:22:53Z

Hello, @lostella. I have an update about this issue. I've used pytorch-ts, which is a library on top of GluonTS that reimplements DeepAR and other models in PyTorch and it works nicely.

It might be good to give it a try as well, @GabrielDeza.

lostella · 2021-03-02T13:40:44Z

@fernandocamargoti thanks for the update!

The first thing that comes to mind is that pytorch-ts uses directly the distribution classes provided by PyTorch: negative binomial there uses a slightly different parametrization, which may be responsible for better numerical stability -- we should look into this.

It also looks like MXNet's negative binomial uses the same parametrization. The best thing would be to rely directly on MXNet's distributions of course, as soon as they're out (v2.0 I believe).

For reference:

fernandocamargoai · 2021-03-09T11:19:50Z

Well, I've just got a NaN using the PyTorch implementation. When trying to use the model, I've got the following error:

  File "/home/fernandocamargo/anaconda3/envs/logit/lib/python3.7/site-packages/gluonts/model/forecast_generator.py", line 174, in __call__
    outputs = predict_to_numpy(prediction_net, inputs)
  File "/home/fernandocamargo/anaconda3/envs/logit/lib/python3.7/functools.py", line 840, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
  File "/home/fernandocamargo/anaconda3/envs/logit/lib/python3.7/site-packages/gluonts/torch/model/predictor.py", line 38, in _
    return prediction_net(*inputs).cpu().numpy()
  File "/home/fernandocamargo/anaconda3/envs/logit/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/fernandocamargo/anaconda3/envs/logit/lib/python3.7/site-packages/pts/model/deepar/deepar_network.py", line 456, in forward
    begin_states=state,
  File "/home/fernandocamargo/anaconda3/envs/logit/lib/python3.7/site-packages/pts/model/deepar/deepar_network.py", line 395, in sampling_decoder
    new_samples = distr.sample()
  File "/home/fernandocamargo/anaconda3/envs/logit/lib/python3.7/site-packages/torch/distributions/negative_binomial.py", line 86, in sample
    return torch.poisson(rate)
RuntimeError: invalid Poisson rate, expected rate to be non-negative

I know that it's from a different implementation, but it might be useful somehow. I've debugged the trained model and the decoder_input contains NaN values. After looking further, I've seen that static_feat contains mostly NaN values. So, I debugger the self.unroll_encoder() and surprisingly, it was the embedded that was returning NaNs. Looking into the weights of the nn.Embeddings, there was indeed some NaN values there.

I'll try to train again with torch.auto_grad.detect_anomaly() to try to discover what causes it.

fernandocamargoai · 2021-03-09T21:31:09Z

Well, all I got was:

Traceback (most recent call last):
  File "/opt/anaconda3/envs/logit/lib/python3.7/site-packages/luigi/worker.py", line 191, in run
    new_deps = self._run_get_new_deps()
  File "/opt/anaconda3/envs/logit/lib/python3.7/site-packages/luigi/worker.py", line 133, in _run_get_new_deps
    task_gen = self.task.run()
  File "/root/work/log_inteligente_ai/logit/task/model/base.py", line 167, in run
    dict(fit=self.fit, cross_validation=self.cross_validate)[self.training_type]()
  File "/root/work/log_inteligente_ai/logit/task/model/gluonts.py", line 428, in fit
    prefetch_factor=self.num_prefetch,
  File "/opt/anaconda3/envs/logit/lib/python3.7/site-packages/pts/model/estimator.py", line 145, in train_model
    validation_iter=validation_data_loader,
  File "/opt/anaconda3/envs/logit/lib/python3.7/site-packages/pts/trainer.py", line 86, in __call__
    loss.backward()
  File "/opt/anaconda3/envs/logit/lib/python3.7/site-packages/torch/tensor.py", line 221, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/opt/anaconda3/envs/logit/lib/python3.7/site-packages/torch/autograd/__init__.py", line 132, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: Function 'CudnnRnnBackward' returned nan values in its 0th output.

fernandocamargoai · 2021-03-09T21:43:25Z

I don't know if it might be useful or not, but my loss starts increasing before the error:

18095it [31:13, 9.66it/s, avg_epoch_loss=0.0714, epoch=1]
18095it [30:52, 9.77it/s, avg_epoch_loss=0.0712, epoch=2]
18095it [30:51, 9.77it/s, avg_epoch_loss=0.0717, epoch=3]
18095it [30:43, 9.81it/s, avg_epoch_loss=0.0713, epoch=4]
18095it [31:06, 9.70it/s, avg_epoch_loss=0.075, epoch=5]
18095it [30:47, 9.79it/s, avg_epoch_loss=0.0762, epoch=6]
18095it [30:48, 9.79it/s, avg_epoch_loss=0.0848, epoch=7]
18095it [31:16, 9.64it/s, avg_epoch_loss=0.115, epoch=8]
18095it [30:58, 9.74it/s, avg_epoch_loss=0.262, epoch=9]
18095it [31:07, 9.69it/s, avg_epoch_loss=0.183, epoch=10]

fernandocamargoai · 2021-03-10T10:41:26Z

Actually, the loss might be rising because the pytorch-ts uses OneCycleLR. But it seems that the problem is simply exploding gradients. I'm trying to reduce the gradient clipping value and I'll also try to reduce the LR.

fernandocamargoai · 2021-03-11T11:36:18Z

Reducing the clip_gradient value solved the issue for me with pytorch-ts. It was probably happening because of the increase of the LR with OneCycleLR and how fragile a LSTM is to exploding gradients. I don't know if it would be the same problem for the MXNet implementation of gluonts, though, since it doesn't use OneCycleLR and it didn't happen with a previous MXNet version.

vidarsumo · 2021-06-10T12:20:37Z

Same problem with MXNet implementation. I’ve reduced the learning rate and sometimes it help sometimes not. Maybe I’m not reducing it enough? Is there a way for me to know what clip_gradient to choose or is this a trial and error process?

borchero · 2021-07-08T19:32:09Z

For reference, this issue also occurs on master with the Car Parts, COVID Deaths and Hospital datasets (the latter two are rather small) from the Monash Forecasting Repository.

borchero · 2021-07-08T19:39:23Z

It might be quicker to investigate this issue with the SimpleFeedforwardEstimator though -- this seems to be the same issue and also appears with the taxi_30min and wiki-rolling_nips datasets.

superkido511 · 2021-07-09T03:20:52Z

I also encountered NaN loss when using DeepVAREstimator (mxnet version) on this toy dataset. This only happens when setting both hybridize=False and cell_type="lstm". When I try hybridize=True or change cell_type, the issue does not happen

rng = pd.date_range('2015-02-24', periods=30, freq='T')

df = pd.DataFrame({
    'Date': rng, 
    'Val': np.random.randn(len(rng)),
    'Val2': np.random.randn(len(rng)),
})

Hope this help!

Alit10 · 2022-01-04T17:04:12Z

@fernandocamargoai Did you manage to find a fix for this issue ? I have the same problem when I try to use the NegativeBinomialOutput even with a strong gradient clipping.

Thanks,

fernandocamargoai · 2022-01-04T19:09:08Z

@fernandocamargoai Did you manage to find a fix for this issue ? I have the same problem when I try to use the NegativeBinomialOutput even with a strong gradient clipping.

Thanks,

Well, I've been using the PyTorch implementation. It still happens from time to time, depending on the hyperparameters. What seems to happen is that the parameters of the distribution might fall outside of the valid values depending on the hyperparameters. I noticed that some distributions are even more sensitive. But at least the PyTorch implementation seems better overall.

lostella · 2022-02-17T13:14:32Z

Closed by mistake

@fernandocamargoai I suspect that #1796 and this issue have a common root cause, which may have been fixed in #1893. If either of the issues gets closed because of the fix, maybe the other one can get closed too

alishametkari · 2022-06-09T03:29:47Z

I am getting same exception what is the reason here? What is the meaning of this exception?

fernandocamargoai added the bug Something isn't working label May 18, 2020

lostella added this to the v0.6 milestone Jul 18, 2020

jaheba modified the milestones: v0.6, v0.7 Oct 20, 2020

lostella mentioned this issue Jan 7, 2021

Update requirement to mxnet~=1.7 #1073

Merged

lostella added the info required More info about this item are needed to proceed label Feb 9, 2021

lostella modified the milestones: v0.7, v0.8 Feb 9, 2021

pbruneau mentioned this issue Apr 19, 2021

Error with MQCNNEstimator in benchmark_m4 examples #1405

Open

lostella modified the milestones: v0.8, v0.9 Jul 1, 2021

sevstafiev mentioned this issue Nov 25, 2021

NaN losses leads to overfitting #1796

Open

This was referenced Feb 16, 2022

Change negative binomial parametrization to failure count and log-odds #1890

Closed

Fix negative binomial parameter map #1893

Merged

lostella closed this as completed in #1893 Feb 17, 2022

lostella reopened this Feb 17, 2022

lostella modified the milestones: v0.9, Soon™ Aug 12, 2022

shchur mentioned this issue Apr 7, 2023

SimpleFeedForward crashes due to numerical errors on datasets with large fraction of zeros #2790

Closed

Getting NaN loss after some time with DeepAR and NegativeBinomial #833

Getting NaN loss after some time with DeepAR and NegativeBinomial #833

Comments

fernandocamargoai commented May 18, 2020

Description

To Reproduce

Error message or code output

Environment

samosun commented May 19, 2020

fernandocamargoai commented May 19, 2020

samosun commented May 19, 2020 • edited Loading

fernandocamargoai commented May 19, 2020

lostella commented May 20, 2020

lostella commented May 20, 2020

fernandocamargoai commented May 20, 2020

fernandocamargoai commented May 22, 2020

matthiasanderer commented May 23, 2020

lostella commented May 23, 2020

lostella commented May 23, 2020

fernandocamargoai commented May 28, 2020

fernandocamargoai commented May 28, 2020

fernandocamargoai commented May 28, 2020

lostella commented May 28, 2020

matthiasanderer commented May 29, 2020

fernandocamargoai commented May 30, 2020

fernandocamargoai commented Jun 1, 2020

fernandocamargoai commented Jun 2, 2020

lostella commented Jun 2, 2020

fernandocamargoai commented Jun 2, 2020

lostella commented Jun 8, 2020

lostella commented Jun 8, 2020

fernandocamargoai commented Jun 15, 2020

lostella commented Jan 12, 2021 • edited Loading

fernandocamargoai commented Feb 8, 2021

fernandocamargoai commented Mar 2, 2021

lostella commented Mar 2, 2021

fernandocamargoai commented Mar 9, 2021 • edited Loading

fernandocamargoai commented Mar 9, 2021

fernandocamargoai commented Mar 9, 2021

fernandocamargoai commented Mar 10, 2021

fernandocamargoai commented Mar 11, 2021

vidarsumo commented Jun 10, 2021

borchero commented Jul 8, 2021

borchero commented Jul 8, 2021

superkido511 commented Jul 9, 2021 • edited Loading

Alit10 commented Jan 4, 2022

fernandocamargoai commented Jan 4, 2022

lostella commented Feb 17, 2022 • edited Loading

alishametkari commented Jun 9, 2022

samosun commented May 19, 2020 •

edited

Loading

lostella commented Jan 12, 2021 •

edited

Loading

fernandocamargoai commented Mar 9, 2021 •

edited

Loading

superkido511 commented Jul 9, 2021 •

edited

Loading

lostella commented Feb 17, 2022 •

edited

Loading