Error with MQCNNEstimator in benchmark_m4 examples #1405

pbruneau · 2021-04-19T06:55:14Z

Description

Using m4_weekly, for example, MQCNNEstimator yields NaN loss for each batch:

WARNING:gluonts.trainer:Batch [1] of Epoch[0] gave NaN loss and it will be ignored

On epoch end, the program crashes with the error:

gluonts.core.exception.GluonTSUserError: Got NaN in first epoch. Try reducing initial learning rate.

With hybridize=False, these errors do not show up anymore, but training exhibits funny loss values (in the range of a few hundreds). As a reference, DeepAREstimator yields consistent loss values (in a range less that 10), independently of hybridize being True or False.

On a side note, I wonder to which extent this bug is related to #833 .

To Reproduce

A simplified and adapted version of benchmark_m4 is in this gist.

Error message or code output

evaluating MQCNNEstimator on m4_weekly
learning rate from ``lr_scheduler`` has been overwritten by ``learning_rate`` in optimizer.
  0%|                                                                                                                                                | 0/50 [00:00<?, ?it/s]/usr/local/lib/python3.6/dist-packages/gluonts/time_feature/_base.py:121: FutureWarning: weekofyear and week have been deprecated, please use DatetimeIndex.isocalendar().week instead, which returns a Series.  To exactly reproduce the behavior of week and weekofyear and return an Index, you may call pd.Int64Index(idx.isocalendar().week)
  return index.weekofyear / 51.0 - 0.5
[06:49:54] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (set the environment variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
WARNING:gluonts.trainer:Batch [1] of Epoch[0] gave NaN loss and it will be ignored
  2%|##                                                                                                     | 1/50 [00:01<00:57,  1.17s/it, epoch=1/100, avg_epoch_loss=nan]WARNING:gluonts.trainer:Batch [2] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [3] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [4] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [5] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [6] of Epoch[0] gave NaN loss and it will be ignored
 12%|############3                                                                                          | 6/50 [00:01<00:07,  6.17it/s, epoch=1/100, avg_epoch_loss=nan]WARNING:gluonts.trainer:Batch [7] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [8] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [9] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [10] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [11] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [12] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [13] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [14] of Epoch[0] gave NaN loss and it will be ignored
 28%|############################5                                                                         | 14/50 [00:01<00:02, 15.93it/s, epoch=1/100, avg_epoch_loss=nan]WARNING:gluonts.trainer:Batch [15] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [16] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [17] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [18] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [19] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [20] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [21] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [22] of Epoch[0] gave NaN loss and it will be ignored
 44%|############################################8                                                         | 22/50 [00:01<00:01, 25.81it/s, epoch=1/100, avg_epoch_loss=nan]WARNING:gluonts.trainer:Batch [23] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [24] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [25] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [26] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [27] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [28] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [29] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [30] of Epoch[0] gave NaN loss and it will be ignored
 60%|#############################################################2                                        | 30/50 [00:01<00:00, 34.97it/s, epoch=1/100, avg_epoch_loss=nan]WARNING:gluonts.trainer:Batch [31] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [32] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [33] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [34] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [35] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [36] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [37] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [38] of Epoch[0] gave NaN loss and it will be ignored
 76%|#############################################################################5                        | 38/50 [00:01<00:00, 43.61it/s, epoch=1/100, avg_epoch_loss=nan]WARNING:gluonts.trainer:Batch [39] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [40] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [41] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [42] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [43] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [44] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [45] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [46] of Epoch[0] gave NaN loss and it will be ignored
 92%|#############################################################################################8        | 46/50 [00:01<00:00, 50.33it/s, epoch=1/100, avg_epoch_loss=nan]WARNING:gluonts.trainer:Batch [47] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [48] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [49] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [50] of Epoch[0] gave NaN loss and it will be ignored
100%|######################################################################################################| 50/50 [00:01<00:00, 26.63it/s, epoch=1/100, avg_epoch_loss=nan]
Traceback (most recent call last):
  File "benchmark_m4_sample.py", line 89, in <module>
    evaluate("m4_weekly", "MQCNNEstimator")
  File "benchmark_m4_sample.py", line 69, in evaluate
    predictor = estimator.train(dataset.train)
  File "/usr/local/lib/python3.6/dist-packages/gluonts/model/estimator.py", line 276, in train
    **kwargs,
  File "/usr/local/lib/python3.6/dist-packages/gluonts/model/estimator.py", line 249, in train_model
    validation_iter=validation_data_loader,
  File "/usr/local/lib/python3.6/dist-packages/gluonts/mx/trainer/_base.py", line 392, in __call__
    "Got NaN in first epoch. Try reducing initial learning rate."
gluonts.core.exception.GluonTSUserError: Got NaN in first epoch. Try reducing initial learning rate.

Environment

Operating system: Ubnutu 18.04
Python version: 3.6.9
GluonTS version: 0.6.7
MXNet version: 1.7.0
CUDA: 10.1
CUDNN: 7

The text was updated successfully, but these errors were encountered:

lostella · 2021-04-19T08:02:24Z

Hey @pbruneau thanks for submitting this!

One note:

With hybridize=False, these errors do not show up anymore, but training exhibits funny loss values (in the range of a few hundreds). As a reference, DeepAREstimator yields consistent loss values (in a range less that 10), independently of hybridize being True or False.

This is not necessarily an issue, since DeepAREstimator and MQCNNEstimator optimize rather different losses: the former optimizes negative log-likelihood wrt some parametric distribution family, the latter optimizes the average quantile loss over the grid of quantile levels it is configured with.

pbruneau · 2021-04-19T08:33:22Z

Hi @lostella, thank you for the timely answer and explanation! After checking the final result, the metrics eventually obtained do not look absurd indeed.

I guess only the main issue of NaN loss values showing up with MQCNNEstimator and hybridize=True remains then.

lostella · 2021-05-10T21:45:41Z

hey @pbruneau I'm sorry to get back to this so late: I could reproduce the issue on gluonts master using mxnet-cu101==1.7.0, and upgrading to mxnet-cu101==1.8.0 solved it (MQCNN trains fine with hybridize=True).

Could you confirm?

pbruneau · 2021-05-11T10:41:37Z

Yes, it looks all right now, thanks for the feedback @lostella!

I'm closing the issue.

Mayalin11 · 2022-03-01T19:40:36Z

Hi, I am facing same issue even if I used MQCNNEstimator with mxnet-cu101==1.8.0.

Here is the trainer and estimator parameters:

!pip install git+https://github.com/awslabs/gluon-ts.git
!pip install mxnet-cu101==1.8.0 -f https://dist.mxnet.io/python/all

trainer=Trainer(
                batch_size=32, # also tried with 64
                clip_gradient=10.0,
                ctx=ctx, 
                epochs=20, 
                hybridize=False,
                init="xavier", 
                learning_rate=0.0001, # also tried with 0.001 
                # learning_rate_decay_factor=0.05, 
                # minimum_learning_rate=5e-05, 
                num_batches_per_epoch=50, 
                patience=10, 
                # weight_decay=1e-08,
                )

estimator = MQCNNEstimator(freq="M",
                            prediction_length = prediction_len,
                            context_length = prediction_len*2,
                            trainer=trainer,
                            use_feat_static_cat = False,
                            use_feat_dynamic_real = True,
                            scaling=False, # also tried with True
                            # cardinality = stat_cat_cardinalities,
                            distr_output = StudentTOutput(),
                            channels_seq=[30,30,30],
                            dilation_seq=[1,3,5],
                            )

I also tried with hybridize/scaling True and False and with various learning rates. I am still getting this error:

/usr/local/lib/python3.7/dist-packages/gluonts/dataset/common.py:324: FutureWarning: The 'freq' argument in Timestamp is deprecated and will be removed in a future version.
  timestamp = pd.Timestamp(timestamp_input, freq=freq)
/usr/local/lib/python3.7/dist-packages/gluonts/dataset/common.py:327: FutureWarning: Timestamp.freq is deprecated and will be removed in a future version
  if isinstance(timestamp.freq, Tick):
/usr/local/lib/python3.7/dist-packages/gluonts/dataset/common.py:338: FutureWarning: Timestamp.freq is deprecated and will be removed in a future version
  return timestamp.freq.rollforward(timestamp)
/usr/local/lib/python3.7/dist-packages/gluonts/transform/split.py:36: FutureWarning: Timestamp.freq is deprecated and will be removed in a future version
  return _shift_timestamp_helper(ts, ts.freq, offset)
/usr/local/lib/python3.7/dist-packages/gluonts/transform/feature.py:352: FutureWarning: Timestamp.freq is deprecated and will be removed in a future version
  self._min_time_point, self._max_time_point, freq=start.freq
  0%|          | 0/50 [00:00<?, ?it/s]Batch [1] of Epoch[0] gave NaN loss and it will be ignored
/usr/local/lib/python3.7/dist-packages/gluonts/transform/split.py:36: FutureWarning: Timestamp.freq is deprecated and will be removed in a future version
  return _shift_timestamp_helper(ts, ts.freq, offset)
Batch [2] of Epoch[0] gave NaN loss and it will be ignored
Batch [3] of Epoch[0] gave NaN loss and it will be ignored
Batch [4] of Epoch[0] gave NaN loss and it will be ignored
Batch [5] of Epoch[0] gave NaN loss and it will be ignored
Batch [6] of Epoch[0] gave NaN loss and it will be ignored
Batch [7] of Epoch[0] gave NaN loss and it will be ignored
Batch [8] of Epoch[0] gave NaN loss and it will be ignored
Batch [9] of Epoch[0] gave NaN loss and it will be ignored
Batch [10] of Epoch[0] gave NaN loss and it will be ignored
Batch [11] of Epoch[0] gave NaN loss and it will be ignored
Batch [12] of Epoch[0] gave NaN loss and it will be ignored
Batch [13] of Epoch[0] gave NaN loss and it will be ignored
Batch [14] of Epoch[0] gave NaN loss and it will be ignored
Batch [15] of Epoch[0] gave NaN loss and it will be ignored
Batch [16] of Epoch[0] gave NaN loss and it will be ignored
Batch [17] of Epoch[0] gave NaN loss and it will be ignored
Batch [18] of Epoch[0] gave NaN loss and it will be ignored
Batch [19] of Epoch[0] gave NaN loss and it will be ignored
Batch [20] of Epoch[0] gave NaN loss and it will be ignored
 40%|████      | 20/50 [00:10<00:15,  1.98it/s, epoch=1/20, avg_epoch_loss=nan]Batch [21] of Epoch[0] gave NaN loss and it will be ignored
Batch [22] of Epoch[0] gave NaN loss and it will be ignored
Batch [23] of Epoch[0] gave NaN loss and it will be ignored
Batch [24] of Epoch[0] gave NaN loss and it will be ignored
Batch [25] of Epoch[0] gave NaN loss and it will be ignored
Batch [26] of Epoch[0] gave NaN loss and it will be ignored
Batch [27] of Epoch[0] gave NaN loss and it will be ignored
Batch [28] of Epoch[0] gave NaN loss and it will be ignored
Batch [29] of Epoch[0] gave NaN loss and it will be ignored
Batch [30] of Epoch[0] gave NaN loss and it will be ignored
Batch [31] of Epoch[0] gave NaN loss and it will be ignored
Batch [32] of Epoch[0] gave NaN loss and it will be ignored
Batch [33] of Epoch[0] gave NaN loss and it will be ignored
Batch [34] of Epoch[0] gave NaN loss and it will be ignored
Batch [35] of Epoch[0] gave NaN loss and it will be ignored
Batch [36] of Epoch[0] gave NaN loss and it will be ignored
Batch [37] of Epoch[0] gave NaN loss and it will be ignored
Batch [38] of Epoch[0] gave NaN loss and it will be ignored
Batch [39] of Epoch[0] gave NaN loss and it will be ignored
Batch [40] of Epoch[0] gave NaN loss and it will be ignored
 80%|████████  | 40/50 [00:22<00:05,  1.76it/s, epoch=1/20, avg_epoch_loss=nan]Batch [41] of Epoch[0] gave NaN loss and it will be ignored
Batch [42] of Epoch[0] gave NaN loss and it will be ignored
Batch [43] of Epoch[0] gave NaN loss and it will be ignored
Batch [44] of Epoch[0] gave NaN loss and it will be ignored
Batch [45] of Epoch[0] gave NaN loss and it will be ignored
Batch [46] of Epoch[0] gave NaN loss and it will be ignored
Batch [47] of Epoch[0] gave NaN loss and it will be ignored
Batch [48] of Epoch[0] gave NaN loss and it will be ignored
Batch [49] of Epoch[0] gave NaN loss and it will be ignored
Batch [50] of Epoch[0] gave NaN loss and it will be ignored
100%|██████████| 50/50 [00:30<00:00,  1.65it/s, epoch=1/20, avg_epoch_loss=nan]
---------------------------------------------------------------------------
GluonTSUserError                          Traceback (most recent call last)
[<ipython-input-10-6a21663e1979>](https://localhost:8080/#) in <module>()
      1 predictor_2019_12_t, test_data_2019_12 = model(prediction_len = prediction_len, all_df=all_df_2019_12, distr_output = StudentTOutput(),
----> 2                                                epochs=20, num_batches_per_epoch=50, learning_rate=0.0001, scaling=False)

7 frames
[/usr/local/lib/python3.7/dist-packages/gluonts/mx/trainer/learning_rate_scheduler.py](https://localhost:8080/#) in on_epoch_end(self, epoch_no, epoch_loss, training_network, trainer, best_epoch_info, ctx)
    231             if best_epoch_info["epoch_no"] == -1:
    232                 raise GluonTSUserError(
--> 233                     "Got NaN in first epoch. Try reducing initial learning rate."
    234                 )
    235 

GluonTSUserError: Got NaN in first epoch. Try reducing initial learning rate.

Thanks in advance!

UmutAlihan · 2022-04-23T21:34:38Z

I am having the same issue, any help is appreciated!

lostella · 2022-04-25T05:44:02Z

If this happens when StudentTOutput is used, the problem may be related to #1894

pbruneau added the bug Something isn't working label Apr 19, 2021

lostella added this to the v0.8 milestone May 10, 2021

pbruneau closed this as completed May 11, 2021

lostella reopened this Apr 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error with MQCNNEstimator in benchmark_m4 examples #1405

Error with MQCNNEstimator in benchmark_m4 examples #1405

pbruneau commented Apr 19, 2021

lostella commented Apr 19, 2021

pbruneau commented Apr 19, 2021

lostella commented May 10, 2021

pbruneau commented May 11, 2021

Mayalin11 commented Mar 1, 2022

UmutAlihan commented Apr 23, 2022

lostella commented Apr 25, 2022

Error with MQCNNEstimator in benchmark_m4 examples #1405

Error with MQCNNEstimator in benchmark_m4 examples #1405

Comments

pbruneau commented Apr 19, 2021

Description

To Reproduce

Error message or code output

Environment

lostella commented Apr 19, 2021

pbruneau commented Apr 19, 2021

lostella commented May 10, 2021

pbruneau commented May 11, 2021

Mayalin11 commented Mar 1, 2022

UmutAlihan commented Apr 23, 2022

lostella commented Apr 25, 2022