You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This model template, grabbed from current model when the script hangs, is UnivariateMotif:
{"model_number": 0, "model_name": "UnivariateMotif", "model_param_dict": {"window": 10, "point_method": "weighted_mean", "distance_metric": "sqeuclidean", "k": 5, "max_windows": 10000}, "model_transform_dict": {"fillna": "rolling_mean", "transformations": {"0": "bkfilter", "1": "AlignLastDiff", "2": "AlignLastValue"}, "transformation_params": {"0": {}, "1": {"rows": 90, "displacement_rows": 4, "quantile": 0.9, "decay_span": 90}, "2": {"rows": 1, "lag": 1, "method": "additive", "strength": 0.7, "first_value_only": false}}}}
However, I'm quite sure it's GluonTS that's causing the stalling. Ran this training 3 times in a row and it hangs on epoch 1/40, 2/40 26/40 for some reason. There's no error message thrown and I can't interrupt the model either to make it move on to the next (model_interrupt=True). I'm sure it's not a memory or CPU resource issue because I monitored and verified that 3 times as well. It just stops somewhere during the 40-step iteration, that's all I know.
Here's the latest output from the console from the latest run:
INFO:gluonts.trainer:Epoch[149] Learning rate is 4.8828125e-07
INFO:gluonts.trainer:Epoch[149] Elapsed time 18.687 seconds
INFO:gluonts.trainer:Epoch[149] Evaluation metric 'epoch_loss'=-9.647407
INFO:root:Computing averaged parameters.
INFO:root:Loading averaged parameters.
INFO:gluonts.trainer:End model training
WARNING:gluonts.time_feature.seasonality:Multiple 5 does not divide base seasonality 1. Falling back to seasonality 1.
INFO:gluonts.mx.model.wavenet._estimator:Using dilation depth 4 and receptive field length 16
INFO:gluonts.trainer:Start model training
INFO:gluonts.trainer:Epoch[0] Learning rate is 0.001
INFO:gluonts.trainer:Number of parameters in WaveNetTraining: 74749
INFO:gluonts.trainer:Epoch[0] Elapsed time 18.185 seconds
INFO:gluonts.trainer:Epoch[0] Evaluation metric 'epoch_loss'=4.803844
INFO:gluonts.trainer:Epoch[1] Learning rate is 0.001
INFO:gluonts.trainer:Epoch[1] Elapsed time 18.023 seconds
INFO:gluonts.trainer:Epoch[1] Evaluation metric 'epoch_loss'=2.650010
INFO:gluonts.trainer:Epoch[2] Learning rate is 0.001
Sorry this is all the information I got, hope it's of any use and helping to pinpoint the issue. I'm running on 0.6.5 by the way on CPU only.
The text was updated successfully, but these errors were encountered:
Pytorch is installed in the environment as well, but I didn't explicitly set GluonTS to use that, so I suppose it's running on the default Mxnet backend.
If the current model file is written before the model starts to train then I would say it is the UnivariateMotif model template causing the stalling, however that doesn't seem to rhyme with the log outputs, does it?
Sorry I don't have more information to share. If you need me to run it once again to get more debugging information please let me know (and how to retrieve the extra debug info :) ).
This model template, grabbed from current model when the script hangs, is UnivariateMotif:
{"model_number": 0, "model_name": "UnivariateMotif", "model_param_dict": {"window": 10, "point_method": "weighted_mean", "distance_metric": "sqeuclidean", "k": 5, "max_windows": 10000}, "model_transform_dict": {"fillna": "rolling_mean", "transformations": {"0": "bkfilter", "1": "AlignLastDiff", "2": "AlignLastValue"}, "transformation_params": {"0": {}, "1": {"rows": 90, "displacement_rows": 4, "quantile": 0.9, "decay_span": 90}, "2": {"rows": 1, "lag": 1, "method": "additive", "strength": 0.7, "first_value_only": false}}}}
However, I'm quite sure it's GluonTS that's causing the stalling. Ran this training 3 times in a row and it hangs on epoch 1/40, 2/40 26/40 for some reason. There's no error message thrown and I can't interrupt the model either to make it move on to the next (model_interrupt=True). I'm sure it's not a memory or CPU resource issue because I monitored and verified that 3 times as well. It just stops somewhere during the 40-step iteration, that's all I know.
Here's the latest output from the console from the latest run:
and from the root log file:
Sorry this is all the information I got, hope it's of any use and helping to pinpoint the issue. I'm running on 0.6.5 by the way on CPU only.
The text was updated successfully, but these errors were encountered: