When I move checkpoints from Ubuntu (Azure server) to Windows (my local system) - I get error #162

saippuakauppias · 2017-02-20T20:02:14Z

Hello. I'm not a very typical situation and it's probably likely linked to the Torch, but I'm asking you to help me.

I wrote a semi-automatic script to start the training of the neural network in NC instances (GPU, Tesla K80) in Azure. There I am using cuda docker.
And I was able to run torch-rnn on Win7 and Win10 with distro-win (it was very painful, especially on Win7 - the easiest way to do it on Win10!). If you read this issue and want to run torch-rnn to Win - you will wonder how I was able to run torch-hdf5.

My question is this: when I trained the network on Azure (Ubuntu 16.04 x64, with GPU) and moved the checkpoint files in Windows7 (x64, without GPU, CPU only) - I got an error when running the sample.lua:

C:\torch-rnn>th sample.lua -checkpoint cp\checkpoint_49800.t7 -length 1000 -gpu
-1
C:\distro-win\install.\bin\luajit.exe: ...\install.\luarocks\systree/share/lua
/5.1/torch\File.lua:370: table index is nil
stack traceback:
...\install.\luarocks\systree/share/lua/5.1/torch\File.lua:370: in func
tion 'readObject'
...n\install.\luarocks\systree/share/lua/5.1/nn\Module.lua:192: in func
tion 'read'
...\install.\luarocks\systree/share/lua/5.1/torch\File.lua:351: in func
tion 'readObject'
...\install.\luarocks\systree/share/lua/5.1/torch\File.lua:369: in func
tion 'readObject'
...\install.\luarocks\systree/share/lua/5.1/torch\File.lua:369: in func
tion 'readObject'
...n\install.\luarocks\systree/share/lua/5.1/nn\Module.lua:192: in func
tion 'read'
...\install.\luarocks\systree/share/lua/5.1/torch\File.lua:351: in func
tion 'readObject'
...\install.\luarocks\systree/share/lua/5.1/torch\File.lua:369: in func
tion 'readObject'
...\install.\luarocks\systree/share/lua/5.1/torch\File.lua:409: in func
tion 'load'
sample.lua:19: in main chunk
[C]: in function 'dofile'
....\luarocks\systree\lib\luarocks\rocks\trepl\scm-1\bin\th:150: in main
chunk
[C]: at 0x013f4f1eb0

It probably has to do with the version Torch and deserialize it .t7 files because I found similar problems: #148 and #80 . And if you try to run the test from the last issue, it will fail with the same error:

C:\torch-rnn>th
th> require 'LanguageModel'
true
[0.1003s]
th> path = 'C:/torch-rnn/cp/checkpoint_49800.t7'
[0.0001s]
th> checkpoint = torch.load(path)
...\install.\luarocks\systree/share/lua/5.1/torch\File.lua:370: table index is
nil
stack traceback:
...\install.\luarocks\systree/share/lua/5.1/torch\File.lua:370: in func
tion 'readObject'
...n\install.\luarocks\systree/share/lua/5.1/nn\Module.lua:192: in func
tion 'read'
...\install.\luarocks\systree/share/lua/5.1/torch\File.lua:351: in func
tion 'readObject'
...\install.\luarocks\systree/share/lua/5.1/torch\File.lua:369: in func
tion 'readObject'
...\install.\luarocks\systree/share/lua/5.1/torch\File.lua:369: in func
tion 'readObject'
...n\install.\luarocks\systree/share/lua/5.1/nn\Module.lua:192: in func
tion 'read'
...\install.\luarocks\systree/share/lua/5.1/torch\File.lua:351: in func
tion 'readObject'
...\install.\luarocks\systree/share/lua/5.1/torch\File.lua:369: in func
tion 'readObject'
...\install.\luarocks\systree/share/lua/5.1/torch\File.lua:409: in func
tion 'load'
[string "checkpoint = torch.load(path)"]:1: in main chunk
[C]: in function 'xpcall'
...\install.\luarocks\systree/share/lua/5.1/trepl/init.lua:679: in func
tion 'repl'
....\luarocks\systree\lib\luarocks\rocks\trepl\scm-1\bin\th:204: in main
chunk
[C]: at 0x013f1e1eb0
[0.0113s]

I tried to train a network to another Windows (Win10 64b, with the GPU) and use checkpoint files from there on my Win7 (64b, without GPU, only CPU) and it worked! I read in other issues that the file transfer from the GPU to CPU works fine.

I understand that this error is related to deserialization, but I can't solve it. Please, please help me, I want to train a network on Azure and use on Windows.

antihutka · 2017-02-21T01:55:24Z

Try saving the checkpoint in text format, then load on Windows and convert it back to binary.

$ th
th> require 'LanguageModel'
th> cp = torch.load('checkpoint.t7')
th> torch.save('checkpoint_text', cp, 'ascii')
# on Windows
th> cp = torch.load('checkpoint_text', 'ascii')
th> torch.save('checkpoint.t7', cp)

saippuakauppias · 2017-02-21T11:51:21Z

@antihutka Wow! This is a very simple and workable solution, as I was looking for! Thank you very much!

But what is it? It may be worth the developers to release a patch to fix this?

antihutka · 2017-02-21T14:43:16Z

I'm not sure there's a lot that can be done here. Torch's binary serialization formats are incompatible between platforms and writing a conversion script might be the best option.

ChrisCummins · 2017-04-26T08:02:53Z

Marking this is as solved. Platform dependent binary serialization is documented in torch. If other users encounter this problem, it may be worth documenting in this project too.

lord-alfred mentioned this issue Feb 20, 2017

../torch/File.lua:370 table index is nil torch/torch7#744

Open

ChrisCummins closed this as completed Apr 26, 2017

dgcrouse mentioned this issue Apr 27, 2017

Trouble sampling on Raspberry Pi 2 model B #148

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When I move checkpoints from Ubuntu (Azure server) to Windows (my local system) - I get error #162

When I move checkpoints from Ubuntu (Azure server) to Windows (my local system) - I get error #162

saippuakauppias commented Feb 20, 2017 •

edited

Loading

antihutka commented Feb 21, 2017

saippuakauppias commented Feb 21, 2017

antihutka commented Feb 21, 2017

ChrisCummins commented Apr 26, 2017

When I move checkpoints from Ubuntu (Azure server) to Windows (my local system) - I get error #162

When I move checkpoints from Ubuntu (Azure server) to Windows (my local system) - I get error #162

Comments

saippuakauppias commented Feb 20, 2017 • edited Loading

antihutka commented Feb 21, 2017

saippuakauppias commented Feb 21, 2017

antihutka commented Feb 21, 2017

ChrisCummins commented Apr 26, 2017

saippuakauppias commented Feb 20, 2017 •

edited

Loading