Training crash with Seq2Seq model #494

stu1130 · 2021-01-07T19:44:33Z

On windows I got an error code of 0xC0000374 and on linux I got an error message "corrupted double-linked list", both seems memory problem. I do the training on CPU. The codes are like the following:

try(Model model = Model.newInstance("time-series")){
   NDManager nd = model.getNDManager();
   NDArray inputs = nd.create(new float[][]{
         {1.0f,2.0f,3.0f,4.0f,5.0f,6.0f},
         {2.0f,3.0f,4.0f,5.0f,6.0f,7.0f},
         {3.0f,4.0f,5.0f,6.0f,7.0f,8.0f},
         {4.0f,5.0f,6.0f,7.0f,8.0f,9.0f},
         {5.0f,6.0f,7.0f,8.0f,9.0f,10.0f},
         {6.0f,7.0f,8.0f,9.0f,10.0f,11.0f},
         {7.0f,8.0f,9.0f,10.0f,11.0f,12.0f},
         {8.0f,9.0f,10.0f,11.0f,12.0f,13.0f},
         {9.0f,10.0f,11.0f,12.0f,13.0f,14.0f},
         {10.0f,11.0f,12.0f,13.0f,14.0f,15.0f}
   });
   Shape inputShape = inputs.getShape();
   long cnt = inputShape.get(0);
   long dur = inputShape.get(1);
   long predDur = 2L;
   long trainDur = 3L;
   long start = dur-trainDur-predDur-1;
   NDArray encoderInputs = inputs.get(":,"+start+":"+(start+trainDur)).reshape(new Shape(cnt,trainDur,1L));
   NDArray decoderInputs = inputs.get(":,"+(start+trainDur)+":"+(start+trainDur+predDur)).reshape(new Shape(cnt,predDur,1L));
   int batchSize = 1;
   ArrayDataset trainingDataset = new ArrayDataset.Builder()
         .setData(encoderInputs)
         .optLabels(decoderInputs)
         .setSampling(batchSize,false)
         .build();
   Encoder encoder = new SimpleTextEncoder(LSTM.builder()
         .setNumStackedLayers(1)
         .setStateSize(2)
         .build());
   Decoder decoder = new SimpleTextDecoder(LSTM.builder()
         .setNumStackedLayers(1)
         .setStateSize(2)
         .build(),1);
   EncoderDecoder net = new EncoderDecoder(encoder,decoder);
   model.setBlock(net);
   Loss loss = Loss.l1Loss();
   Tracker tracker = Tracker.fixed(0.001f);
   Optimizer optimizer = Optimizer.sgd().setLearningRateTracker(tracker).build();
   TrainingListener[] listeners = TrainingListener.Defaults.logging();
   TrainingConfig config = new DefaultTrainingConfig(loss).optOptimizer(optimizer).addTrainingListeners(listeners);
   int numEpochs = 10;
   try(Trainer trainer = model.newTrainer(config)){
      trainer.initialize(encoderInputs.getShape(),decoderInputs.getShape());
      for (int epoch = 0; epoch < numEpochs; epoch++) {
         EasyTrain.fit(trainer,numEpochs,trainingDataset,null);
      }
   }
}

The lib versions are djl-0.9.0 and mxnet-1.7.0. and the crash point seems always on EasyTrain.java line 83 collector.backward(lossValue). Why the backward fails and how to solve it then? Thanks!

The text was updated successfully, but these errors were encountered:

stu1130 · 2021-01-07T19:49:24Z

The diff file that reproduces the issue: tt.txt

lanking520 · 2021-01-07T21:31:40Z

Maybe there is an issue with the LSTM layer implementation

stu1130 · 2021-01-13T18:29:26Z

This is MKLDNN bug. The fix has been patched to MXNet 1.8.
apache/mxnet#19022

stu1130 · 2021-01-13T18:50:28Z

I verified the crash is gone when I use mxnet 1.8

* fix the bytesio location * update with force kill running container * update test * remove unused params

stu1130 added the bug Something isn't working label Jan 7, 2021

stu1130 changed the title ~~Training crash with Seq2Seq~~ Training crash with Seq2Seq model Jan 7, 2021

lanking520 self-assigned this Feb 17, 2021

lanking520 closed this as completed Feb 17, 2021

Lokiiiiii pushed a commit to Lokiiiiii/djl that referenced this issue Oct 10, 2023

fix the bytesio location (deepjavalibrary#494)

30a079a

* fix the bytesio location * update with force kill running container * update test * remove unused params

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training crash with Seq2Seq model #494

Training crash with Seq2Seq model #494

stu1130 commented Jan 7, 2021

stu1130 commented Jan 7, 2021

lanking520 commented Jan 7, 2021

stu1130 commented Jan 13, 2021

stu1130 commented Jan 13, 2021

Training crash with Seq2Seq model #494

Training crash with Seq2Seq model #494

Comments

stu1130 commented Jan 7, 2021

stu1130 commented Jan 7, 2021

lanking520 commented Jan 7, 2021

stu1130 commented Jan 13, 2021

stu1130 commented Jan 13, 2021