Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with MQCNNEstimator in benchmark_m4 examples #1405

Open
pbruneau opened this issue Apr 19, 2021 · 7 comments
Open

Error with MQCNNEstimator in benchmark_m4 examples #1405

pbruneau opened this issue Apr 19, 2021 · 7 comments
Labels
bug Something isn't working

Comments

@pbruneau
Copy link

Description

Using m4_weekly, for example, MQCNNEstimator yields NaN loss for each batch:

WARNING:gluonts.trainer:Batch [1] of Epoch[0] gave NaN loss and it will be ignored

On epoch end, the program crashes with the error:

gluonts.core.exception.GluonTSUserError: Got NaN in first epoch. Try reducing initial learning rate.

With hybridize=False, these errors do not show up anymore, but training exhibits funny loss values (in the range of a few hundreds). As a reference, DeepAREstimator yields consistent loss values (in a range less that 10), independently of hybridize being True or False.

On a side note, I wonder to which extent this bug is related to #833 .

To Reproduce

A simplified and adapted version of benchmark_m4 is in this gist.

Error message or code output

evaluating MQCNNEstimator on m4_weekly
learning rate from ``lr_scheduler`` has been overwritten by ``learning_rate`` in optimizer.
  0%|                                                                                                                                                | 0/50 [00:00<?, ?it/s]/usr/local/lib/python3.6/dist-packages/gluonts/time_feature/_base.py:121: FutureWarning: weekofyear and week have been deprecated, please use DatetimeIndex.isocalendar().week instead, which returns a Series.  To exactly reproduce the behavior of week and weekofyear and return an Index, you may call pd.Int64Index(idx.isocalendar().week)
  return index.weekofyear / 51.0 - 0.5
[06:49:54] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (set the environment variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
WARNING:gluonts.trainer:Batch [1] of Epoch[0] gave NaN loss and it will be ignored
  2%|##                                                                                                     | 1/50 [00:01<00:57,  1.17s/it, epoch=1/100, avg_epoch_loss=nan]WARNING:gluonts.trainer:Batch [2] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [3] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [4] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [5] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [6] of Epoch[0] gave NaN loss and it will be ignored
 12%|############3                                                                                          | 6/50 [00:01<00:07,  6.17it/s, epoch=1/100, avg_epoch_loss=nan]WARNING:gluonts.trainer:Batch [7] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [8] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [9] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [10] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [11] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [12] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [13] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [14] of Epoch[0] gave NaN loss and it will be ignored
 28%|############################5                                                                         | 14/50 [00:01<00:02, 15.93it/s, epoch=1/100, avg_epoch_loss=nan]WARNING:gluonts.trainer:Batch [15] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [16] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [17] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [18] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [19] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [20] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [21] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [22] of Epoch[0] gave NaN loss and it will be ignored
 44%|############################################8                                                         | 22/50 [00:01<00:01, 25.81it/s, epoch=1/100, avg_epoch_loss=nan]WARNING:gluonts.trainer:Batch [23] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [24] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [25] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [26] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [27] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [28] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [29] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [30] of Epoch[0] gave NaN loss and it will be ignored
 60%|#############################################################2                                        | 30/50 [00:01<00:00, 34.97it/s, epoch=1/100, avg_epoch_loss=nan]WARNING:gluonts.trainer:Batch [31] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [32] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [33] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [34] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [35] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [36] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [37] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [38] of Epoch[0] gave NaN loss and it will be ignored
 76%|#############################################################################5                        | 38/50 [00:01<00:00, 43.61it/s, epoch=1/100, avg_epoch_loss=nan]WARNING:gluonts.trainer:Batch [39] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [40] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [41] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [42] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [43] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [44] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [45] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [46] of Epoch[0] gave NaN loss and it will be ignored
 92%|#############################################################################################8        | 46/50 [00:01<00:00, 50.33it/s, epoch=1/100, avg_epoch_loss=nan]WARNING:gluonts.trainer:Batch [47] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [48] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [49] of Epoch[0] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [50] of Epoch[0] gave NaN loss and it will be ignored
100%|######################################################################################################| 50/50 [00:01<00:00, 26.63it/s, epoch=1/100, avg_epoch_loss=nan]
Traceback (most recent call last):
  File "benchmark_m4_sample.py", line 89, in <module>
    evaluate("m4_weekly", "MQCNNEstimator")
  File "benchmark_m4_sample.py", line 69, in evaluate
    predictor = estimator.train(dataset.train)
  File "/usr/local/lib/python3.6/dist-packages/gluonts/model/estimator.py", line 276, in train
    **kwargs,
  File "/usr/local/lib/python3.6/dist-packages/gluonts/model/estimator.py", line 249, in train_model
    validation_iter=validation_data_loader,
  File "/usr/local/lib/python3.6/dist-packages/gluonts/mx/trainer/_base.py", line 392, in __call__
    "Got NaN in first epoch. Try reducing initial learning rate."
gluonts.core.exception.GluonTSUserError: Got NaN in first epoch. Try reducing initial learning rate.

Environment

  • Operating system: Ubnutu 18.04
  • Python version: 3.6.9
  • GluonTS version: 0.6.7
  • MXNet version: 1.7.0
  • CUDA: 10.1
  • CUDNN: 7
@pbruneau pbruneau added the bug Something isn't working label Apr 19, 2021
@lostella
Copy link
Contributor

Hey @pbruneau thanks for submitting this!

One note:

With hybridize=False, these errors do not show up anymore, but training exhibits funny loss values (in the range of a few hundreds). As a reference, DeepAREstimator yields consistent loss values (in a range less that 10), independently of hybridize being True or False.

This is not necessarily an issue, since DeepAREstimator and MQCNNEstimator optimize rather different losses: the former optimizes negative log-likelihood wrt some parametric distribution family, the latter optimizes the average quantile loss over the grid of quantile levels it is configured with.

@pbruneau
Copy link
Author

Hi @lostella, thank you for the timely answer and explanation! After checking the final result, the metrics eventually obtained do not look absurd indeed.

I guess only the main issue of NaN loss values showing up with MQCNNEstimator and hybridize=True remains then.

@lostella lostella added this to the v0.8 milestone May 10, 2021
@lostella
Copy link
Contributor

hey @pbruneau I'm sorry to get back to this so late: I could reproduce the issue on gluonts master using mxnet-cu101==1.7.0, and upgrading to mxnet-cu101==1.8.0 solved it (MQCNN trains fine with hybridize=True).

Could you confirm?

@pbruneau
Copy link
Author

Yes, it looks all right now, thanks for the feedback @lostella!

I'm closing the issue.

@Mayalin11
Copy link

Hi, I am facing same issue even if I used MQCNNEstimator with mxnet-cu101==1.8.0.

Here is the trainer and estimator parameters:

!pip install git+https://github.com/awslabs/gluon-ts.git
!pip install mxnet-cu101==1.8.0 -f https://dist.mxnet.io/python/all

trainer=Trainer(
                batch_size=32, # also tried with 64
                clip_gradient=10.0,
                ctx=ctx, 
                epochs=20, 
                hybridize=False,
                init="xavier", 
                learning_rate=0.0001, # also tried with 0.001 
                # learning_rate_decay_factor=0.05, 
                # minimum_learning_rate=5e-05, 
                num_batches_per_epoch=50, 
                patience=10, 
                # weight_decay=1e-08,
                )

estimator = MQCNNEstimator(freq="M",
                            prediction_length = prediction_len,
                            context_length = prediction_len*2,
                            trainer=trainer,
                            use_feat_static_cat = False,
                            use_feat_dynamic_real = True,
                            scaling=False, # also tried with True
                            # cardinality = stat_cat_cardinalities,
                            distr_output = StudentTOutput(),
                            channels_seq=[30,30,30],
                            dilation_seq=[1,3,5],
                            )

I also tried with hybridize/scaling True and False and with various learning rates. I am still getting this error:

/usr/local/lib/python3.7/dist-packages/gluonts/dataset/common.py:324: FutureWarning: The 'freq' argument in Timestamp is deprecated and will be removed in a future version.
  timestamp = pd.Timestamp(timestamp_input, freq=freq)
/usr/local/lib/python3.7/dist-packages/gluonts/dataset/common.py:327: FutureWarning: Timestamp.freq is deprecated and will be removed in a future version
  if isinstance(timestamp.freq, Tick):
/usr/local/lib/python3.7/dist-packages/gluonts/dataset/common.py:338: FutureWarning: Timestamp.freq is deprecated and will be removed in a future version
  return timestamp.freq.rollforward(timestamp)
/usr/local/lib/python3.7/dist-packages/gluonts/transform/split.py:36: FutureWarning: Timestamp.freq is deprecated and will be removed in a future version
  return _shift_timestamp_helper(ts, ts.freq, offset)
/usr/local/lib/python3.7/dist-packages/gluonts/transform/feature.py:352: FutureWarning: Timestamp.freq is deprecated and will be removed in a future version
  self._min_time_point, self._max_time_point, freq=start.freq
  0%|          | 0/50 [00:00<?, ?it/s]Batch [1] of Epoch[0] gave NaN loss and it will be ignored
/usr/local/lib/python3.7/dist-packages/gluonts/transform/split.py:36: FutureWarning: Timestamp.freq is deprecated and will be removed in a future version
  return _shift_timestamp_helper(ts, ts.freq, offset)
Batch [2] of Epoch[0] gave NaN loss and it will be ignored
Batch [3] of Epoch[0] gave NaN loss and it will be ignored
Batch [4] of Epoch[0] gave NaN loss and it will be ignored
Batch [5] of Epoch[0] gave NaN loss and it will be ignored
Batch [6] of Epoch[0] gave NaN loss and it will be ignored
Batch [7] of Epoch[0] gave NaN loss and it will be ignored
Batch [8] of Epoch[0] gave NaN loss and it will be ignored
Batch [9] of Epoch[0] gave NaN loss and it will be ignored
Batch [10] of Epoch[0] gave NaN loss and it will be ignored
Batch [11] of Epoch[0] gave NaN loss and it will be ignored
Batch [12] of Epoch[0] gave NaN loss and it will be ignored
Batch [13] of Epoch[0] gave NaN loss and it will be ignored
Batch [14] of Epoch[0] gave NaN loss and it will be ignored
Batch [15] of Epoch[0] gave NaN loss and it will be ignored
Batch [16] of Epoch[0] gave NaN loss and it will be ignored
Batch [17] of Epoch[0] gave NaN loss and it will be ignored
Batch [18] of Epoch[0] gave NaN loss and it will be ignored
Batch [19] of Epoch[0] gave NaN loss and it will be ignored
Batch [20] of Epoch[0] gave NaN loss and it will be ignored
 40%|████      | 20/50 [00:10<00:15,  1.98it/s, epoch=1/20, avg_epoch_loss=nan]Batch [21] of Epoch[0] gave NaN loss and it will be ignored
Batch [22] of Epoch[0] gave NaN loss and it will be ignored
Batch [23] of Epoch[0] gave NaN loss and it will be ignored
Batch [24] of Epoch[0] gave NaN loss and it will be ignored
Batch [25] of Epoch[0] gave NaN loss and it will be ignored
Batch [26] of Epoch[0] gave NaN loss and it will be ignored
Batch [27] of Epoch[0] gave NaN loss and it will be ignored
Batch [28] of Epoch[0] gave NaN loss and it will be ignored
Batch [29] of Epoch[0] gave NaN loss and it will be ignored
Batch [30] of Epoch[0] gave NaN loss and it will be ignored
Batch [31] of Epoch[0] gave NaN loss and it will be ignored
Batch [32] of Epoch[0] gave NaN loss and it will be ignored
Batch [33] of Epoch[0] gave NaN loss and it will be ignored
Batch [34] of Epoch[0] gave NaN loss and it will be ignored
Batch [35] of Epoch[0] gave NaN loss and it will be ignored
Batch [36] of Epoch[0] gave NaN loss and it will be ignored
Batch [37] of Epoch[0] gave NaN loss and it will be ignored
Batch [38] of Epoch[0] gave NaN loss and it will be ignored
Batch [39] of Epoch[0] gave NaN loss and it will be ignored
Batch [40] of Epoch[0] gave NaN loss and it will be ignored
 80%|████████  | 40/50 [00:22<00:05,  1.76it/s, epoch=1/20, avg_epoch_loss=nan]Batch [41] of Epoch[0] gave NaN loss and it will be ignored
Batch [42] of Epoch[0] gave NaN loss and it will be ignored
Batch [43] of Epoch[0] gave NaN loss and it will be ignored
Batch [44] of Epoch[0] gave NaN loss and it will be ignored
Batch [45] of Epoch[0] gave NaN loss and it will be ignored
Batch [46] of Epoch[0] gave NaN loss and it will be ignored
Batch [47] of Epoch[0] gave NaN loss and it will be ignored
Batch [48] of Epoch[0] gave NaN loss and it will be ignored
Batch [49] of Epoch[0] gave NaN loss and it will be ignored
Batch [50] of Epoch[0] gave NaN loss and it will be ignored
100%|██████████| 50/50 [00:30<00:00,  1.65it/s, epoch=1/20, avg_epoch_loss=nan]
---------------------------------------------------------------------------
GluonTSUserError                          Traceback (most recent call last)
[<ipython-input-10-6a21663e1979>](https://localhost:8080/#) in <module>()
      1 predictor_2019_12_t, test_data_2019_12 = model(prediction_len = prediction_len, all_df=all_df_2019_12, distr_output = StudentTOutput(),
----> 2                                                epochs=20, num_batches_per_epoch=50, learning_rate=0.0001, scaling=False)

7 frames
[/usr/local/lib/python3.7/dist-packages/gluonts/mx/trainer/learning_rate_scheduler.py](https://localhost:8080/#) in on_epoch_end(self, epoch_no, epoch_loss, training_network, trainer, best_epoch_info, ctx)
    231             if best_epoch_info["epoch_no"] == -1:
    232                 raise GluonTSUserError(
--> 233                     "Got NaN in first epoch. Try reducing initial learning rate."
    234                 )
    235 

GluonTSUserError: Got NaN in first epoch. Try reducing initial learning rate.

Thanks in advance!

@UmutAlihan
Copy link

I am having the same issue, any help is appreciated!

@lostella lostella reopened this Apr 24, 2022
@lostella
Copy link
Contributor

If this happens when StudentTOutput is used, the problem may be related to #1894

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants