Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training crash with Seq2Seq model #494

Closed
stu1130 opened this issue Jan 7, 2021 · 4 comments
Closed

Training crash with Seq2Seq model #494

stu1130 opened this issue Jan 7, 2021 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@stu1130
Copy link
Contributor

stu1130 commented Jan 7, 2021

On windows I got an error code of 0xC0000374 and on linux I got an error message "corrupted double-linked list", both seems memory problem. I do the training on CPU. The codes are like the following:

try(Model model = Model.newInstance("time-series")){
   NDManager nd = model.getNDManager();
   NDArray inputs = nd.create(new float[][]{
         {1.0f,2.0f,3.0f,4.0f,5.0f,6.0f},
         {2.0f,3.0f,4.0f,5.0f,6.0f,7.0f},
         {3.0f,4.0f,5.0f,6.0f,7.0f,8.0f},
         {4.0f,5.0f,6.0f,7.0f,8.0f,9.0f},
         {5.0f,6.0f,7.0f,8.0f,9.0f,10.0f},
         {6.0f,7.0f,8.0f,9.0f,10.0f,11.0f},
         {7.0f,8.0f,9.0f,10.0f,11.0f,12.0f},
         {8.0f,9.0f,10.0f,11.0f,12.0f,13.0f},
         {9.0f,10.0f,11.0f,12.0f,13.0f,14.0f},
         {10.0f,11.0f,12.0f,13.0f,14.0f,15.0f}
   });
   Shape inputShape = inputs.getShape();
   long cnt = inputShape.get(0);
   long dur = inputShape.get(1);
   long predDur = 2L;
   long trainDur = 3L;
   long start = dur-trainDur-predDur-1;
   NDArray encoderInputs = inputs.get(":,"+start+":"+(start+trainDur)).reshape(new Shape(cnt,trainDur,1L));
   NDArray decoderInputs = inputs.get(":,"+(start+trainDur)+":"+(start+trainDur+predDur)).reshape(new Shape(cnt,predDur,1L));
   int batchSize = 1;
   ArrayDataset trainingDataset = new ArrayDataset.Builder()
         .setData(encoderInputs)
         .optLabels(decoderInputs)
         .setSampling(batchSize,false)
         .build();
   Encoder encoder = new SimpleTextEncoder(LSTM.builder()
         .setNumStackedLayers(1)
         .setStateSize(2)
         .build());
   Decoder decoder = new SimpleTextDecoder(LSTM.builder()
         .setNumStackedLayers(1)
         .setStateSize(2)
         .build(),1);
   EncoderDecoder net = new EncoderDecoder(encoder,decoder);
   model.setBlock(net);
   Loss loss = Loss.l1Loss();
   Tracker tracker = Tracker.fixed(0.001f);
   Optimizer optimizer = Optimizer.sgd().setLearningRateTracker(tracker).build();
   TrainingListener[] listeners = TrainingListener.Defaults.logging();
   TrainingConfig config = new DefaultTrainingConfig(loss).optOptimizer(optimizer).addTrainingListeners(listeners);
   int numEpochs = 10;
   try(Trainer trainer = model.newTrainer(config)){
      trainer.initialize(encoderInputs.getShape(),decoderInputs.getShape());
      for (int epoch = 0; epoch < numEpochs; epoch++) {
         EasyTrain.fit(trainer,numEpochs,trainingDataset,null);
      }
   }
}

The lib versions are djl-0.9.0 and mxnet-1.7.0. and the crash point seems always on EasyTrain.java line 83 collector.backward(lossValue). Why the backward fails and how to solve it then? Thanks!

@stu1130 stu1130 added the bug Something isn't working label Jan 7, 2021
@stu1130 stu1130 changed the title Training crash with Seq2Seq Training crash with Seq2Seq model Jan 7, 2021
@stu1130
Copy link
Contributor Author

stu1130 commented Jan 7, 2021

The diff file that reproduces the issue: tt.txt

@lanking520
Copy link
Contributor

Maybe there is an issue with the LSTM layer implementation

@stu1130
Copy link
Contributor Author

stu1130 commented Jan 13, 2021

This is MKLDNN bug. The fix has been patched to MXNet 1.8.
apache/mxnet#19022

@stu1130
Copy link
Contributor Author

stu1130 commented Jan 13, 2021

I verified the crash is gone when I use mxnet 1.8

@lanking520 lanking520 self-assigned this Feb 17, 2021
Lokiiiiii pushed a commit to Lokiiiiii/djl that referenced this issue Oct 10, 2023
* fix the bytesio location

* update with force kill running container

* update test

* remove unused params
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants