Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When I move checkpoints from Ubuntu (Azure server) to Windows (my local system) - I get error #162

Closed
saippuakauppias opened this issue Feb 20, 2017 · 4 comments

Comments

@saippuakauppias
Copy link

saippuakauppias commented Feb 20, 2017

Hello. I'm not a very typical situation and it's probably likely linked to the Torch, but I'm asking you to help me.

I wrote a semi-automatic script to start the training of the neural network in NC instances (GPU, Tesla K80) in Azure. There I am using cuda docker.
And I was able to run torch-rnn on Win7 and Win10 with distro-win (it was very painful, especially on Win7 - the easiest way to do it on Win10!). If you read this issue and want to run torch-rnn to Win - you will wonder how I was able to run torch-hdf5.

My question is this: when I trained the network on Azure (Ubuntu 16.04 x64, with GPU) and moved the checkpoint files in Windows7 (x64, without GPU, CPU only) - I got an error when running the sample.lua:

C:\torch-rnn>th sample.lua -checkpoint cp\checkpoint_49800.t7 -length 1000 -gpu
-1
C:\distro-win\install.\bin\luajit.exe: ...\install.\luarocks\systree/share/lua
/5.1/torch\File.lua:370: table index is nil
stack traceback:
...\install.\luarocks\systree/share/lua/5.1/torch\File.lua:370: in func
tion 'readObject'
...n\install.\luarocks\systree/share/lua/5.1/nn\Module.lua:192: in func
tion 'read'
...\install.\luarocks\systree/share/lua/5.1/torch\File.lua:351: in func
tion 'readObject'
...\install.\luarocks\systree/share/lua/5.1/torch\File.lua:369: in func
tion 'readObject'
...\install.\luarocks\systree/share/lua/5.1/torch\File.lua:369: in func
tion 'readObject'
...n\install.\luarocks\systree/share/lua/5.1/nn\Module.lua:192: in func
tion 'read'
...\install.\luarocks\systree/share/lua/5.1/torch\File.lua:351: in func
tion 'readObject'
...\install.\luarocks\systree/share/lua/5.1/torch\File.lua:369: in func
tion 'readObject'
...\install.\luarocks\systree/share/lua/5.1/torch\File.lua:409: in func
tion 'load'
sample.lua:19: in main chunk
[C]: in function 'dofile'
....\luarocks\systree\lib\luarocks\rocks\trepl\scm-1\bin\th:150: in main
chunk
[C]: at 0x013f4f1eb0

It probably has to do with the version Torch and deserialize it .t7 files because I found similar problems: #148 and #80 . And if you try to run the test from the last issue, it will fail with the same error:

C:\torch-rnn>th
th> require 'LanguageModel'
true
[0.1003s]
th> path = 'C:/torch-rnn/cp/checkpoint_49800.t7'
[0.0001s]
th> checkpoint = torch.load(path)
...\install.\luarocks\systree/share/lua/5.1/torch\File.lua:370: table index is
nil
stack traceback:
...\install.\luarocks\systree/share/lua/5.1/torch\File.lua:370: in func
tion 'readObject'
...n\install.\luarocks\systree/share/lua/5.1/nn\Module.lua:192: in func
tion 'read'
...\install.\luarocks\systree/share/lua/5.1/torch\File.lua:351: in func
tion 'readObject'
...\install.\luarocks\systree/share/lua/5.1/torch\File.lua:369: in func
tion 'readObject'
...\install.\luarocks\systree/share/lua/5.1/torch\File.lua:369: in func
tion 'readObject'
...n\install.\luarocks\systree/share/lua/5.1/nn\Module.lua:192: in func
tion 'read'
...\install.\luarocks\systree/share/lua/5.1/torch\File.lua:351: in func
tion 'readObject'
...\install.\luarocks\systree/share/lua/5.1/torch\File.lua:369: in func
tion 'readObject'
...\install.\luarocks\systree/share/lua/5.1/torch\File.lua:409: in func
tion 'load'
[string "checkpoint = torch.load(path)"]:1: in main chunk
[C]: in function 'xpcall'
...\install.\luarocks\systree/share/lua/5.1/trepl/init.lua:679: in func
tion 'repl'
....\luarocks\systree\lib\luarocks\rocks\trepl\scm-1\bin\th:204: in main
chunk
[C]: at 0x013f1e1eb0
[0.0113s]

I tried to train a network to another Windows (Win10 64b, with the GPU) and use checkpoint files from there on my Win7 (64b, without GPU, only CPU) and it worked! I read in other issues that the file transfer from the GPU to CPU works fine.

I understand that this error is related to deserialization, but I can't solve it. Please, please help me, I want to train a network on Azure and use on Windows.

@antihutka
Copy link
Contributor

Try saving the checkpoint in text format, then load on Windows and convert it back to binary.

$ th
th> require 'LanguageModel'
th> cp = torch.load('checkpoint.t7')
th> torch.save('checkpoint_text', cp, 'ascii')
# on Windows
th> cp = torch.load('checkpoint_text', 'ascii')
th> torch.save('checkpoint.t7', cp)

@saippuakauppias
Copy link
Author

@antihutka Wow! This is a very simple and workable solution, as I was looking for! Thank you very much!

But what is it? It may be worth the developers to release a patch to fix this?

@antihutka
Copy link
Contributor

I'm not sure there's a lot that can be done here. Torch's binary serialization formats are incompatible between platforms and writing a conversion script might be the best option.

@ChrisCummins
Copy link
Collaborator

Marking this is as solved. Platform dependent binary serialization is documented in torch. If other users encounter this problem, it may be worth documenting in this project too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants